Parquet schema management

Question

I have recently started working on a new project where we use Spark to write/read data in Parquet format. The project is changing rapidly and here and there we do need to change the schema of parquet files regularly. I am currently struggling with versioning data and code.

We use versioning system for our codebase but its very hard (at least in my opinion) to do it for data itself. I also have migration script, which I use to migrate data from old schema to the new schema but along the way I loose the information about what was the schema of a parquet file before running the migration. It is my priority to know the original schema as well.

So my questions would be

How do you keep track of parquet files which have schema inconsistencies in your HDFS? I have several Terabytes of parquet files.
After running migration script to convert your current schema(original) to the new schema, how do you keep track of original schema?
Is there any existing tool to achieve this or I have to write something of my own?

Shantanu singh · Accepted Answer

You can use delta lake it has feature of overwriting schema and maintaining previous versions of data

delta lake is basically a bunch of parquet files with a delta log(commit log)

data.write.format("parquet").mode("overwrite").save("/tmp/delta-table")

The above code snippet overwrites normal parquet file which means that previous data will be overwritten

data.write.format("delta").mode("overwrite").save("/tmp/delta-table")

The above is delta lake overwrite it go and check the delta log and overwrite a new version of data in the delta lake as version 1 with time stamp(if previous data was version zero) we can also time travel(read previous versions of data) in delta lake

df = spark.read.format("delta").option("versionAsOf", 0).load("/tmp/delta-table")

this code can be used read zeroth version of the data

Parquet schema management

Tags:

version-control

data-migration

hadoop

parquet

unknown

1 Answers

Shantanu singh

Recent Activity

Donate For Us

Parquet schema management

Tags:

version-control

data-migration

hadoop

parquet

unknown

1 Answers

Shantanu singh

Related questions

Recent Activity

Donate For Us