I have a column called target_col_a in my dataframe with Timestamp value which have been casted to String e.g. 2020-05-27 08:00:00.
I then partitionBy this column as per below.
target_dataset \
.write.mode('overwrite') \
.format('parquet') \
.partitionBy('target_col_a') \
.save('s3://my-bucket/my-path')
However, my s3 path turns out like s3://my-bucket/my-path/target_col_a=2020-05-27 08%3A00%3A00/part-0-file1.snappy.parquet
Is there a way to output the partition without the %3A and retain :?
Note: when I use Glue native DynamicFrame to write to S3 or Redshift UNLOAD to S3 the partitioning comes as desired (without the %3A and with :) e.g.
glueContext.write_dynamic_frame.from_options(
frame = target_dataset,
connection_type = "s3",
connection_options = {
"path": "s3://my-bucket/my-path/",
"partitionKeys": ["target_col_a"]},
format = "parquet",
transformation_ctx = "datasink2"
)
The short answer is no, you can't.
Pyspark uses hadoop client libraries for input and output. These libraries create paths using the Java URI package. Spaces and colons are not valid URI characters, so they're URL encoded before writing. Pyspark will handle the decoding automatically when the dataset is read, but if you want to access the datasets outside of Spark or Hadoop, you'll need to URL decode the column values.
Specially characters like spaces and : cannot be part of any S3 URI.
Even if some how manage to create one you would face difficulties later on every time you use them.
Better to replace these character with URI acceptable ones.
You should follow the key name convention described in this paragraph called Object Key Guidelines of Amazon S3.
The following character sets are generally safe for use in key names:
Alphanumeric characters [0-9a-zA-Z]
Special characters !, -, _, ., *, ', (, and )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With