Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark's performance tuning

Tags:

apache-spark

I am working on a project where in I have to tune spark's performance. I have found four most important parameters that will help in tuning spark's performance. They are as follows:

  1. spark.memory.fraction
  2. spark.memory.offHeap.size
  3. spark.storage.memoryFraction
  4. spark.shuffle.memoryFraction

I wanted to know whether I am going in the right direction or not? Please let me know if I missed out on some other parameters also.

Thanks in advance.

like image 346
Srinivas Shekar Avatar asked Oct 24 '25 15:10

Srinivas Shekar


1 Answers

This is is quite broad to answer honestly. The right path to optimize performance is mainly described in the official documentation in the section concerning Tuning Spark.

Generally speaking, there is lots of factors to optimize spark jobs :

  • Data Serialization
  • Memory Tuning
  • Level of Parallelism
  • Memory Usage of Reduce Tasks
  • Broadcasting Large Variables
  • Data Locality

It's mainly centralized around data serialization, memory tuning and a trade-off between precision/approximation techniques to get the job done fast.

EDIT:

Courtesy of @zero323 :

I'd point out, that all but one option mentioned in the question, are deprecated and used only in legacy mode.

like image 53
eliasah Avatar answered Oct 26 '25 11:10

eliasah



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!