Does anyone know whether it's possible to tell the Glue writer to keep the column you're partitioning on in the actual dataframe?
https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/
Here, $outpath is a placeholder for the base output path in S3. The partitionKeys parameter can also be specified in Python in the connection_options dict:
glue_context.write_dynamic_frame.from_options(
frame = projectedEvents,
connection_options = {"path": "$outpath", "partitionKeys": ["type"]},
format = "parquet")
When you execute this write, the type field is removed from the individual records and is encoded in the directory structure.
I would like to keep the type field in the individual record.
I am not 100% sure if it possible to tell Glue to keep the column, but in the meantime you could use this workaround:
projectedEvents = projectedEvents.withColumn("type_partition",projectedEvents["type"])
glue_context.write_dynamic_frame.from_options(
frame=projectedEvents,
connection_options={"path": "$outpath", "partitionKeys": ["type_partition"]},
format="parquet"
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With