Why unpersist() does not remove my path from the cache in pyspark in Azure Databricks?

Question

I am reading a bench of csv file at a specific path :

spark.read.format('csv').load('/mnt/path/')

I am caching my dataframe in order to access corrupt records enter link description here

data_frame.cache()

At the end of my notebook, i want to remove this path from the cache by using data_frame.unpersist()

Then I am changing the underlying data, for example deleting or adding new files to the table path

But if i read again the csv, spark.read.format('csv').load('/mnt/path/'), spark does not have the last changes, it still shows the cached data.

Which makes me think that the dataframe is not really uncached.

The only way it can work out, it is restarting the cluster.

I dont want to use spark.catalog.clearCache() as this would impact caching all the jobs running on the cluster. I only want to uncache the specific dataframe from the current notebook.

Any suggestion or observation would be much appreciated.

edit : I was not assigning it to my dataframe. It looks like there is a difference between data_frame = data_frame.unpersist() and data_frame.unpersist()

chomar.c · Accepted Answer

Try adding blocking flag set to true so that your computation waits until that cached data is really removed.

[ def unpersist(blocking: Boolean) ]
data_frame.unpersist(true)

Why unpersist() does not remove my path from the cache in pyspark in Azure Databricks?

Tags:

caching

pyspark

databricks

azure-databricks

CheapMango

1 Answers

chomar.c

Recent Activity

Donate For Us

Why unpersist() does not remove my path from the cache in pyspark in Azure Databricks?

Tags:

caching

pyspark

databricks

azure-databricks

CheapMango

1 Answers

chomar.c

Related questions

Recent Activity

Donate For Us