Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Setting spark.local.dir in Pyspark/Jupyter

I'm using Pyspark from a Jupyter notebook and attempting to write a large parquet dataset to S3.
I get a 'no space left on device' error. I searched around and learned that it's because /tmp is filling up.
I want to now edit spark.local.dir to point to a directory that has space.
How can I set this parameter?
Most solutions I found suggested setting it when using spark-submit. However, I am not using spark-submit and just running it as a script from Jupyter.

Edit: I'm using Sparkmagic to work with an EMR backend.I think spark.local.dir needs to be set in the config JSON, but I am not sure how to specify it there.
I tried adding it in session_configs but it didn't work.

like image 987
c3p0 Avatar asked Sep 05 '25 03:09

c3p0


1 Answers

The answer depends on where your SparkContext comes from.

If you are starting Jupyter with pyspark:

PYSPARK_DRIVER_PYTHON='jupyter'\
PYSPARK_DRIVER_PYTHON_OPTS="notebook" \
PYSPARK_PYTHON="python" \
pyspark

then your SparkContext is already initialized when you receive your Python kernel in Jupyter. You should therefore pass a parameter to pyspark (at the end of the command above): --conf spark.local.dir=...

If you are constructing a SparkContext in Python

If you have code in your notebook like:

import pyspark
sc = pyspark.SparkContext()

then you can configure the Spark context before creating it:

import pyspark
conf = pyspark.SparkConf()
conf.set('spark.local.dir', '...')
sc = pyspark.SparkContext(conf=conf)

Configuring Spark from the command line:

It's also possible to configure Spark by editing a configuration file in bash. The file you want to edit is ${SPARK_HOME}/conf/spark-defaults.conf. You can append to it as follows (creating it if it doesn't exist):

echo 'spark.local.dir /foo/bar' >> ${SPARK_HOME}/conf/spark-defaults.conf
like image 126
Tim Avatar answered Sep 07 '25 22:09

Tim