AWS Glue Crawler - single record CSV

Question

I have a bunch of files stored in S3 in CSV format (no header) but in many cases only one record per file. for example:

"6ad0638e-e7d3-4c33-8271-5b3972c6155f",1532653200000

When I run crawler it creates for each file separated table.

Question(s):

How to force a crawler to use a single (already created) table?
Do I need to create a custom classifier? If my field names are rId and ts, can somebody give me Grok file example?

Thanks

Vladimir Ilic · Accepted Answer

I contacted AWS Support and here are details:

Problem is caused by the files which have a single record. By default Glue crawler used LazySimpleSerde to classify CSV files. LazySimpleSerde needs at least one newline character to identify a CSV file which is its limitation.

The right path to solve this issue is by considering the use of Grok pattern.

In order to confirm this, I have tested some scenarios at my end, with your data and custom pattern. I have created 3 files name file1.csv with one record, file2.csv with two records and file3.csv with one record. Also, proper Grok pattern should consider new lines as well with $ i.e.

%{QUOTEDSTRING:rid:string},%{NUMBER:ts:long}$

I ran the crawler without any custom pattern on all the files and it created multiple tables.
I edited the crawler and added the custom pattern and re-ran the same crawler but it still created multiple tables.
I created a new crawler with Grok pattern and ran it on file1 and file2, it only created one table with proper columns.
I added file3 and ran the crawler again it only updated the same table and no new tables got created.
I have tested the scenario 3 and 4 using partitions in S3(as you might have partitioned data) and still got one table.

As per my observations, it seems that the problem might be due to the crawler caching the older classification details. So I'd request you to create a new crawler and point it to a new database in the catalog.

AWS Glue Crawler - single record CSV

Tags:

amazon-web-services

grok

aws-glue

Vladimir Ilic

1 Answers

Vladimir Ilic

Recent Activity

Donate For Us

AWS Glue Crawler - single record CSV

Tags:

amazon-web-services

grok

aws-glue

Vladimir Ilic

1 Answers

Vladimir Ilic

Related questions

Recent Activity

Donate For Us