Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parquet schema management

I have recently started working on a new project where we use Spark to write/read data in Parquet format. The project is changing rapidly and here and there we do need to change the schema of parquet files regularly. I am currently struggling with versioning data and code.

We use versioning system for our codebase but its very hard (at least in my opinion) to do it for data itself. I also have migration script, which I use to migrate data from old schema to the new schema but along the way I loose the information about what was the schema of a parquet file before running the migration. It is my priority to know the original schema as well.

So my questions would be

  • How do you keep track of parquet files which have schema inconsistencies in your HDFS? I have several Terabytes of parquet files.
  • After running migration script to convert your current schema(original) to the new schema, how do you keep track of original schema?
  • Is there any existing tool to achieve this or I have to write something of my own?
like image 841
unknown Avatar asked Oct 16 '25 11:10

unknown


1 Answers

You can use delta lake it has feature of overwriting schema and maintaining previous versions of data

delta lake is basically a bunch of parquet files with a delta log(commit log)

data.write.format("parquet").mode("overwrite").save("/tmp/delta-table")

The above code snippet overwrites normal parquet file which means that previous data will be overwritten

data.write.format("delta").mode("overwrite").save("/tmp/delta-table")

The above is delta lake overwrite it go and check the delta log and overwrite a new version of data in the delta lake as version 1 with time stamp(if previous data was version zero) we can also time travel(read previous versions of data) in delta lake

df = spark.read.format("delta").option("versionAsOf", 0).load("/tmp/delta-table")

this code can be used read zeroth version of the data

like image 68
Shantanu singh Avatar answered Oct 19 '25 10:10

Shantanu singh



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!