I am reading a bench of csv file at a specific path :
spark.read.format('csv').load('/mnt/path/')
I am caching my dataframe in order to access corrupt records enter link description here
data_frame.cache()
At the end of my notebook, i want to remove this path from the cache by using data_frame.unpersist()
Then I am changing the underlying data, for example deleting or adding new files to the table path
But if i read again the csv, spark.read.format('csv').load('/mnt/path/'), spark does not have the last changes, it still shows the cached data.
Which makes me think that the dataframe is not really uncached.
The only way it can work out, it is restarting the cluster.
I dont want to use spark.catalog.clearCache() as this would impact caching all the jobs running on the cluster. I only want to uncache the specific dataframe from the current notebook.
Any suggestion or observation would be much appreciated.
edit :
I was not assigning it to my dataframe. It looks like there is a difference between
data_frame = data_frame.unpersist() and data_frame.unpersist()
Try adding blocking flag set to true so that your computation waits until that cached data is really removed.
[ def unpersist(blocking: Boolean) ]
data_frame.unpersist(true)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With