I have a bunch of files stored in S3 in CSV format (no header) but in many cases only one record per file. for example:
"6ad0638e-e7d3-4c33-8271-5b3972c6155f",1532653200000
When I run crawler it creates for each file separated table.
Question(s):
Thanks
I contacted AWS Support and here are details:
Problem is caused by the files which have a single record. By default Glue crawler used LazySimpleSerde to classify CSV files. LazySimpleSerde needs at least one newline character to identify a CSV file which is its limitation.
The right path to solve this issue is by considering the use of Grok pattern.
In order to confirm this, I have tested some scenarios at my end, with your data and custom pattern. I have created 3 files name file1.csv with one record, file2.csv with two records and file3.csv with one record. Also, proper Grok pattern should consider new lines as well with $ i.e.
%{QUOTEDSTRING:rid:string},%{NUMBER:ts:long}$
As per my observations, it seems that the problem might be due to the crawler caching the older classification details. So I'd request you to create a new crawler and point it to a new database in the catalog.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With