I am confused about dealing with executor memory and driver memory in Spark.
My environment settings are as below:
Input data information:
For simple development, I executed my Python code in standalone cluster mode (8 workers, 20 cores, 45.3 G memory) with spark-submit. Now I would like to set executor memory or driver memory for performance tuning.
From the Spark documentation, the definition for executor memory is
Amount of memory to use per executor process, in the same format as JVM memory strings (e.g. 512m, 2g).
How about driver memory?
Executors are worker nodes' processes in charge of running individual tasks in a given Spark job and The spark driver is the program that declares the transformations and actions on RDDs of data and submits such requests to the master.
You can resolve it by setting the partition size: increase the value of spark. sql. shuffle. partitions.
The memory you need to assign to the driver depends on the job.
If the job is based purely on transformations and terminates on some distributed output action like rdd.saveAsTextFile, rdd.saveToCassandra, ... then the memory needs of the driver will be very low. Few 100's of MB will do. The driver is also responsible of delivering files and collecting metrics, but not be involved in data processing.
If the job requires the driver to participate in the computation, like e.g. some ML algo that needs to materialize results and broadcast them on the next iteration, then your job becomes dependent of the amount of data passing through the driver. Operations like .collect,.take and takeSample deliver data to the  driver and hence, the driver needs enough memory to allocate such data.
e.g. If you have an rdd of 3GB in the cluster and call val myresultArray = rdd.collect, then you will need 3GB of memory in the driver to hold that data plus some extra room for the functions mentioned in the first paragraph.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With