I have a pretty basic s3 setup that I would like to query against using Athena. The data is all stored in one bucket, organized into year/month/day/hour folders.
|--data
|   |--2018
|   |   |--01
|   |   |   |--01
|   |   |   |   |--01
|   |   |   |   |   |--file1.json
|   |   |   |   |   |--file2.json
|   |   |   |   |--02
|   |   |   |   |   |--file3.json
|   |   |   |   |   |--file4.json
...
I then setup an AWS Glue Crawler to crawl s3://bucket/data. The schema in all files is identical. I would expect that I would get one database table, with partitions on the year, month, day, etc. 
What I get instead are tens of thousands of tables. There is a table for each file, and a table for each parent partition as well. So far as I can tell, separate tables were created for each file/folder, without a single overarching one where I can query across a large date range.
I followed instructions https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html to the best of my ability, but cannot figure out how to structure my partitions/scanning such that I don't get this huge, mostly worthless dump of data.
The AWS Glue crawler creates multiple tables when your source data files don't use the same: Format (such as CSV, Parquet, or JSON) Compression type (such as SNAPPY, gzip, or bzip2)
When an AWS Glue crawler scans Amazon S3 and detects multiple folders in a bucket, it determines the root of a table in the folder structure and which folders are partitions of a table. The name of the table is based on the Amazon S3 prefix or folder name.
You can use a crawler to populate the AWS Glue Data Catalog with tables. This is the primary method used by most AWS Glue users. A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog.
Table partitions. An AWS Glue table definition of an Amazon Simple Storage Service (Amazon S3) folder can describe a partitioned table. For example, to improve query performance, a partitioned table might separate monthly data into different files using the name of the month as a key.
Glue Crawler leaves a lot to be desired. It's promises to solve a lot of situations, but is really limited in what it actually supports. If your data is stored in directories and does not use Hive-style partitioning (e.g. year=2019/month=02/file.json) it will more often than not mess up. It's especially frustrating when the data is produced by other AWS products, like Kinesis Firehose, which it looks like your data could be.
Depending on how much data you have I might start by just creating an unpartitioned Athena table that pointed to the root of the structure. It's only once your data grows beyond multiple gigabytes or thousands of files that partitioning becomes important.
Another strategy you could employ is to add a Lambda function that gets triggered by an S3 notification whenever a new object lands in your bucket. The function could look at the key and figure out which partition it belongs to and use the Glue API to add that partition to the table. Adding a partition that already exists will return an error from the API, but as long as your function catches it and ignores it you will be fine.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With