Creating dictionary from Pyspark dataframe showing OutOfMemoryError: Java heap space

Question

I have seen and tried many existing StackOverflow posts regarding this issue but none work. I guess my JAVA heap space is not as large as expected for my large dataset, My dataset contains 6.5M rows. My Linux instance contains 64GB Ram with 4 cores. As per this suggestion I need to fix my code but I think making a dictionary from pyspark dataframe should not be very costly. Please advise me if any other way to compute that.

I just want to make a python dictionary from my pyspark dataframe, this is the content of my pyspark dataframe,

property_sql_df.show() shows,

+--------------+------------+--------------------+--------------------+
|            id|country_code|       name|          hash_of_cc_pn_li|
+--------------+------------+--------------------+--------------------+
|  BOND-9129450|          US|Scotron Home w/Ga...|90cb0946cf4139e12...|
|  BOND-1742850|          US|Sited in the Mead...|d5c301f00e9966483...|
|  BOND-3211356|          US|NEW LISTING - Com...|811fa26e240d726ec...|
|  BOND-7630290|          US|EC277- 9 Bedroom ...|d5c301f00e9966483...|
|  BOND-7175508|          US|East Hampton Retr...|90cb0946cf4139e12...|
+--------------+------------+--------------------+--------------------+

What I want is to make a dictionary with hash_of_cc_pn_li as key and id as a list value.

Expected Output

{
  "90cb0946cf4139e12": ["BOND-9129450", "BOND-7175508"]
  "d5c301f00e9966483": ["BOND-1742850","BOND-7630290"]
}

What I have tried so far,

%%time
duplicate_property_list = {}
for ind in property_sql_df.collect(): 
     hashed_value = ind.hash_of_cc_pn_li
     property_id = ind.id
     if hashed_value in duplicate_property_list:
         duplicate_property_list[hashed_value].append(property_id) 
     else:
         duplicate_property_list[hashed_value] = [property_id]

What I get now on the console:

java.lang.OutOfMemoryError: Java heap space

and showing this error on Jupyter notebook output

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:33097)

Eric Doi · Accepted Answer

making a dictionary from pyspark dataframe should not be very costly

This is true in terms of runtime, but this will easily take up a lot of space. Especially if you're doing property_sql_df.collect(), at which point you're loading your entire dataframe into driver memory. At 6.5M rows, you'll already hit 65GB if each row has 10KB, or 10K characters, and we haven't even gotten to the dictionary yet.

First, you can collect just the columns you need (e.g. not name). Second, you can do the aggregation upstream in Spark, which will save some space depending on how many ids there are per hash_of_cc_pn_li:

rows = property_sql_df.groupBy("hash_of_cc_pn_li") \
  .agg(collect_set("id").alias("ids")) \
  .collect()

duplicate_property_list = { row.hash_of_cc_pn_li: row.ids for row in rows }

Napoleon Borntoparty · Answer

Adding accepted answer from linked post for posterity. The answer solves the problem by leveraging write.json method and preventing the collection of too-large dataset to the Driver here:

https://stackoverflow.com/a/63111765/12378881

Creating dictionary from Pyspark dataframe showing OutOfMemoryError: Java heap space

Tags:

java

python

apache-spark

pyspark

Always Sunny

2 Answers

Eric Doi

Napoleon Borntoparty

Recent Activity

Donate For Us

Creating dictionary from Pyspark dataframe showing OutOfMemoryError: Java heap space

Tags:

java

python

apache-spark

pyspark

Always Sunny

2 Answers

Eric Doi

Napoleon Borntoparty

Related questions

Recent Activity

Donate For Us