Rename S3 folder structure for Athena partitioning

Question

I have access to an S3 bucket structured like bucket_name/year/month/day/file.gz with hundreds of files per day. I know that to define a partitioned Athena table over this data it would need to be named as bucket_name/year=year/month=month/day=day. Short of writing a shell script spelling out every day (so, a series of

aws cp --recursive s3://old_bucket/YYYY/MM/DD/* s3://new_bucket/year=YYYY/month=MM/day=DD/

for each value of YYYY/MM/DD in the dataset)

is there a simpler way to approach this? I know about ALTER TABLE ADD PARTITION but again it seems to require me to specify each partition individually.

Theo · Accepted Answer

You don't need to rename the files at all. While it's true that most examples use the Hive-style naming convention Athena does not require it.

There are many ways to add partitions to an Athena table. In your case I would either go with partition projection, which would make new data partitions available immediately. Alternatively you can add partitions manually with ALTER TABLE … ADD PARTITION.

To create a table configured with partition projection you can use this as a starting point:

CREATE EXTERNAL TABLE my_table (
  …
)
PARTITIONED BY (
  `date` string
)
TBLPROPERTIES (
  "projection.enabled" = "true",
  "projection.date.type" = "date",
  "projection.date.range" = "2020/01/01,NOW",
  "projection.date.format" = "yyyy/MM/dd",
  "storage.location.template" = "s3://bucket_name/${date}/"
)

You can then query your table with

SELECT *
FROM my_table
WHERE "date" = '2020/10/24'

Note that the date column/partition key is a string and not a DATE. Athena will take the string and interpolate it into the URI given by storage.location.template. Partition projection is pretty clever, and I encourage you to read the docs to find out what the ….range property does, for example.

Also note that date is a reserved word, and to use it in DDL you must quote it in backticks, but in queries it needs to instead be quoted in double quotes. If you want to avoid having to always quote you can name it something else, but if you do you need to change the name both in the PARTITIONED BY part and the TBLPROPERTIES part.

The alternative to partition projection, which is a fairly new feature, is to add partitions manually. You can do this with the Glue Data Catalog API, which is preferable when writing code in my opinion, or you can do it with DDL, which is more compact and easier to fit into a Stack Overflow answer.

Assuming you have a table partitioned by date like above (but without the TBLPROPERTIES since those are partition projection-specific), you can add partitions like this:

ALTER TABLE my_table ADD IF NOT EXISTS
PARTITION (`date` = '2020-10-22') LOCATION 's3://bucket_name/2020/10/22/'
PARTITION (`date` = '2020-10-23') LOCATION 's3://bucket_name/2020/10/23/'
PARTITION (`date` = '2020-10-24') LOCATION 's3://bucket_name/2020/10/24/'

You can then query your table like this:

SELECT *
FROM my_table
WHERE "date" = '2020-10-24'

Note here that I add the partitions with values for the partition keys that don't correspond exactly to how the dates are represented in the S3 URIs (I format the dates with dashes in the standard ISO way, instead of slashes). When adding partitions manually there doesn't have to be any correspondence at all between the partition keys' values and the S3 URI.

Some people will tell you that you must use Hive-style partitioning with Athena, and that you should then add partitions with MSCK REPAIR TABLE. This is not the case, as I hope I've shown above, and using that command to add partitions is not a good idea, it works for a few partitions, but eventually it will start timing out.

Rename S3 folder structure for Athena partitioning

Tags:

amazon-s3

amazon-athena

Silenced Temporarily

1 Answers

Theo

Recent Activity

Donate For Us

Rename S3 folder structure for Athena partitioning

Tags:

amazon-s3

amazon-athena

Silenced Temporarily

1 Answers

Theo

Related questions

Recent Activity

Donate For Us