I have recently started working on a new project where we use Spark to write/read data in Parquet format. The project is changing rapidly and here and there we do need to change the schema of parquet files regularly. I am currently struggling with versioning data and code.
We use versioning system for our codebase but its very hard (at least in my opinion) to do it for data itself. I also have migration script, which I use to migrate data from old schema to the new schema but along the way I loose the information about what was the schema of a parquet file before running the migration. It is my priority to know the original schema as well.
So my questions would be
You can use delta lake it has feature of overwriting schema and maintaining previous versions of data
delta lake is basically a bunch of parquet files with a delta log(commit log)
data.write.format("parquet").mode("overwrite").save("/tmp/delta-table")
The above code snippet overwrites normal parquet file which means that previous data will be overwritten
data.write.format("delta").mode("overwrite").save("/tmp/delta-table")
The above is delta lake overwrite it go and check the delta log and overwrite a new version of data in the delta lake as version 1 with time stamp(if previous data was version zero) we can also time travel(read previous versions of data) in delta lake
df = spark.read.format("delta").option("versionAsOf", 0).load("/tmp/delta-table")
this code can be used read zeroth version of the data
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With