Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to release a dataframe in spark?

I am using spark do some calculation. every 5 minutes, I got a new data frame. I put it into a dict called dict_1_hour like this

dict_1_hour[timestamp] = dataframe

New data frame comes in to the dict and old dataframe pop out from the dict. Only 12 dataframe are kept in it, that is the data for the recent 1 hour.

So my question is how should I release those dataframe to make sure there is no memory leak?

One API for dataframe seems can do it.(I do not know what the parameter for)

unpersist(blocking=True)
Marks the DataFrame as non-persistent, and remove all blocks for it from memory and disk.

The other way I think is just pop the dataframe from the dict.

dict_1_hour.pop(timestamp)

Python should auto release the unused variable. But I do not know if it is appropriate here. I worried that spark may keep the dataframe if you do not release it explicit

So please advice on which way I should use please

like image 370
Kramer Li Avatar asked Sep 01 '25 17:09

Kramer Li


1 Answers

First of all DataFrame, similar to RDD, is just a local recursive data structure. I goes through the same garbage collection cycle as any other object, both on the Python and JVM side.

The second part you have to consider is persisted data (cache, persist, cacheTable, shuffle files, etc.). This in general handled internally by Spark and, excluding unpersist you don't much control over its lifetime.

Keeping these two things in mind there is not much that can be done beyond simple del on the object.

try:
    del dict_1_hour[timestamp]
except KeyError:
    pass

Still, if DataFrame has been registered as temporary table, make sure to unregister it first:

from py4j.protocol import Py4JError

try:
    sqlContext.dropTempTable("df")
except Py4JError:
    pass
like image 65
zero323 Avatar answered Sep 04 '25 06:09

zero323