I'm trying to work with aws athena to do some queries on json files we have stored in s3. So, I managed to create a simple schema and everything seemed to be fine until I noticed that some of my files are not accounted for.
The keys of the files are user ids, some of those start with _. All of those are missing in athena. They exist in s3. I can get them. They are similar to the other files. But Athena does not see them.
Obviously it does not like underscores at the beginning of keys. Is there a way around this other than renaming all the files? Underscores elsewhere in the key seem to be not an issue.
My schema (I simplified it by removing fields):
CREATE EXTERNAL TABLE IF NOT EXISTS db.table (
`user_id` string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://xyz/myfiles/'
TBLPROPERTIES ('has_encrypted_data'='false');
When you query a table, Amazon Athena uses Presto under the hood.Presto ignores files that start with an underscore underscore _ or a dot starting from presto version 0.60.This is the behavior of Hadoop MapReduce / Hive
https://prestodb.io/docs/current/release/release-0.60.html
Refer to function used by presto to filter the hidden files with org.apache.hadoop.hive.common.FileUtils.HIDDEN_FILES_PATH_FILTER .As the property is derived from Hive the same applies to Hive tables which will ignore the files in particular location.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With