I am using spark do some calculation. every 5 minutes, I got a new data frame. I put it into a dict called dict_1_hour like this
dict_1_hour[timestamp] = dataframe
New data frame comes in to the dict and old dataframe pop out from the dict. Only 12 dataframe are kept in it, that is the data for the recent 1 hour.
So my question is how should I release those dataframe to make sure there is no memory leak?
One API for dataframe seems can do it.(I do not know what the parameter for)
unpersist(blocking=True)
Marks the DataFrame as non-persistent, and remove all blocks for it from memory and disk.
The other way I think is just pop the dataframe from the dict.
dict_1_hour.pop(timestamp)
Python should auto release the unused variable. But I do not know if it is appropriate here. I worried that spark may keep the dataframe if you do not release it explicit
So please advice on which way I should use please
First of all DataFrame
, similar to RDD
, is just a local recursive data structure. I goes through the same garbage collection cycle as any other object, both on the Python and JVM side.
The second part you have to consider is persisted data (cache
, persist
, cacheTable
, shuffle files, etc.). This in general handled internally by Spark and, excluding unpersist
you don't much control over its lifetime.
Keeping these two things in mind there is not much that can be done beyond simple del
on the object.
try:
del dict_1_hour[timestamp]
except KeyError:
pass
Still, if DataFrame
has been registered as temporary table, make sure to unregister it first:
from py4j.protocol import Py4JError
try:
sqlContext.dropTempTable("df")
except Py4JError:
pass
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With