Behavior of the overwrite in spark

Question

I am regularly uploading data on a parquet file which I use for my data analysis using and I want to ensure that the data in my parquet file are not duplicated. The command I use to do this is:

df.write.parquet('my_directory/', mode='overwrite')

Does this ensure that all my non-duplicated data will not be deleted accidentally at some point.

Cheers

DataWrangler · Accepted Answer

The Overwrite as the name implies it rewrites the whole data into the path that you specify. Rewrite in the sense, the data that is available in the df will be written to the path by removing the old files available if any in the path specified. So you can consider this as a DELETE and LOAD scenario, where you read all the records from the datasource lets say Oracle and then do your transformations and delete the parquet and write the new content in the dataframe.

The Dataframe.write supports a list of modes to write the content to the target.

mode –

specifies the behavior of the save operation when data already exists.

append: Append contents of this DataFrame to existing data.

overwrite: Overwrite existing data.

ignore: Silently ignore this operation if data already exists.

error or errorifexists (default case): Throw an exception if data already
exists.

If your intention is to add new data to the parquet then you have to do with append but this brings in a new challenge of duplicates if you are dealing with changing data.

Behavior of the overwrite in spark

Tags:

pyspark

parquet

Robin Nicole

1 Answers

DataWrangler

Recent Activity

Donate For Us

Behavior of the overwrite in spark

Tags:

pyspark

parquet

Robin Nicole

1 Answers

DataWrangler

Related questions

Recent Activity

Donate For Us