My Spark processing logic depends upon long-lived, expensive-to-instantiate utility objects to perform data-persistence operations. Not only are these objects probably not Serializable, but it is probably impractical to distribute their state in any case, as said state likely includes stateful network connections.
What I would like to do instead is instantiate these objects locally within each executor, or locally within threads spawned by each executor. (Either alternative is acceptable, as long as the instantiation does not take place on each tuple in the RDD.)
Is there a way to write my Spark driver program such that it directs executors to invoke a function to instantiate an object locally (and cache it in the executor's local JVM memory space), rather than instantiating it within the driver program then attempting to serialize and distribute it to the executors?
It is possible to share objects at partition level:
I've tried this : How to make Apache Spark mapPartition work correctly?
The repartition to make numPartitions match a multiple of the number of executors.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With