Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Behavior of the overwrite in spark

I am regularly uploading data on a parquet file which I use for my data analysis using and I want to ensure that the data in my parquet file are not duplicated. The command I use to do this is:

df.write.parquet('my_directory/', mode='overwrite')

Does this ensure that all my non-duplicated data will not be deleted accidentally at some point.

Cheers

like image 261
Robin Nicole Avatar asked Dec 13 '25 04:12

Robin Nicole


1 Answers

The Overwrite as the name implies it rewrites the whole data into the path that you specify. Rewrite in the sense, the data that is available in the df will be written to the path by removing the old files available if any in the path specified. So you can consider this as a DELETE and LOAD scenario, where you read all the records from the datasource lets say Oracle and then do your transformations and delete the parquet and write the new content in the dataframe.

The Dataframe.write supports a list of modes to write the content to the target.

mode –

specifies the behavior of the save operation when data already exists.

append: Append contents of this DataFrame to existing data.

overwrite: Overwrite existing data.

ignore: Silently ignore this operation if data already exists.

error or errorifexists (default case): Throw an exception if data already
exists.

If your intention is to add new data to the parquet then you have to do with append but this brings in a new challenge of duplicates if you are dealing with changing data.

like image 123
DataWrangler Avatar answered Dec 16 '25 16:12

DataWrangler



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!