Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use long-lived expensive-to-instantiate utility services where executors run?

Tags:

apache-spark

My Spark processing logic depends upon long-lived, expensive-to-instantiate utility objects to perform data-persistence operations. Not only are these objects probably not Serializable, but it is probably impractical to distribute their state in any case, as said state likely includes stateful network connections.

What I would like to do instead is instantiate these objects locally within each executor, or locally within threads spawned by each executor. (Either alternative is acceptable, as long as the instantiation does not take place on each tuple in the RDD.)

Is there a way to write my Spark driver program such that it directs executors to invoke a function to instantiate an object locally (and cache it in the executor's local JVM memory space), rather than instantiating it within the driver program then attempting to serialize and distribute it to the executors?

like image 356
sumitsu Avatar asked Feb 01 '26 12:02

sumitsu


1 Answers

It is possible to share objects at partition level:

I've tried this : How to make Apache Spark mapPartition work correctly?

The repartition to make numPartitions match a multiple of the number of executors.

like image 151
Paul K. Avatar answered Feb 04 '26 00:02

Paul K.



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!