Unable to load 25GB dataset in PySpark local mode with 56GB RAM free

Question

I am having trouble loading and processing a 25GB Parquet dataset (of stackoverflow.com posts) on a single beefy machine in local mode with 12 cores/64GB of RAM.

I have more memory on my machine that is free and allocated to pyspark than the size of a Parquet dataset (let alone two columns of the dataset), and yet I am unable to run any operations on the DataFrame once I load it. This is confusing, and I can't figure out what to do.

Specifically, I have a Parquet dataset that is 25GB:

$ du -sh data/stackoverflow/parquet/Posts.df.parquet

25G data/stackoverflow/parquet/Posts.df.parquet

I have a machine with 56GB of free RAM:

$ free -h

              total        used        free      shared  buff/cache   
available
Mem:            62G        4.7G         56G         23M        1.7G         
57G
Swap:           63G          0B         63G

I have configured PySpark to use 50GB of RAM (have tried adusting maxResultSize to no effect).

My configuration looks like this:

$ cat ~/spark/conf/spark-defaults.conf

spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec
spark.driver.memory 50g
spark.jars ...
spark.executor.cores 12
spark.driver.maxResultSize 20g

My environment looks like this:

$ cat ~/spark/conf/spark-env.sh

PYSPARK_PYTHON=python3
PYSPARK_DRIVER_PYTHON=python3
SPARK_WORKER_DIR=/nvm/spark/work
SPARK_LOCAL_DIRS=/nvm/spark/local
SPARK_WORKER_MEMORY=50g
SPARK_WORKER_CORES=12

I load the data like this:

$ pyspark

>>> posts = spark.read.parquet('data/stackoverflow/parquet/Posts.df.parquet')

It loads ok, but any operation - including if I run a limit(10) on the DataFrame first - results in an out of heap space error.

>>> posts.limit(10)\
    .select('_ParentId','_Body')\
    .filter(posts._ParentId == 9915705)\
    .show()

[Stage 1:>                                                       (0 + 12) / 195]19/06/30 17:26:13 ERROR Executor: Exception in task 7.0 in stage 1.0 (TID 8)
java.lang.OutOfMemoryError: Java heap space
19/06/30 17:26:13 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4)
java.lang.OutOfMemoryError: Java heap space
19/06/30 17:26:13 ERROR Executor: Exception in task 5.0 in stage 1.0 (TID 6)
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.parquet.bytes.HeapByteBufferAllocator.allocate(HeapByteBufferAllocator.java:32)
at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1166)
at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/06/30 17:26:13 ERROR Executor: Exception in task 10.0 in stage 1.0 (TID 11)
java.lang.OutOfMemoryError: Java heap space
    at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
    at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
    at org.apache.parquet.bytes.HeapByteBufferAllocator.allocate(HeapByteBufferAllocator.java:32)
    at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1166)
    at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
    at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301)
    at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256)
    at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159)
    at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
19/06/30 17:26:13 ERROR Executor: Exception in task 6.0 in stage 1.0 (TID 7)
java.lang.OutOfMemoryError: Java heap space
19/06/30 17:26:13 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 7,5,main]
java.lang.OutOfMemoryError: Java heap space
19/06/30 17:26:13 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 11,5,main]
java.lang.OutOfMemoryError: Java heap space
...

The following will run, suggesting the problem is the _Body field (obviously the largest):

>>> posts.limit(10).select('_Id').show()

+---+
|_Id|
+---+
|  4|
|  6|
|  7|
|  9|
| 11|
| 12|
| 13|
| 14|
| 16|
| 17|
+---+

What am I to do? I could use EMR, but I would like to be able to load this dataset locally and that seems an entirely reasonable thing to be able to do in this situation.

congyh · Accepted Answer

The default memory fraction for Spark's storage and computation is 0.6. Under your config it will be 0.6 * 50GB = 30GB. But the representation of data in memory may consume more space than the serialized disk version.

Please check the section of Memory Management to get more details.

Unable to load 25GB dataset in PySpark local mode with 56GB RAM free

Tags:

java

python

heap-memory

apache-spark

pyspark

rjurney

1 Answers

congyh

Recent Activity

Donate For Us

Unable to load 25GB dataset in PySpark local mode with 56GB RAM free

Tags:

java

python

heap-memory

apache-spark

pyspark

rjurney

1 Answers

congyh

Related questions

Recent Activity

Donate For Us