As far as I understand, while using the Spark Scala interface we have to be careful not to unnecessarily serialize a full object when only one or two attributes are needed: (http://erikerlandson.github.io/blog/2015/03/31/hygienic-closures-for-scala-function-serialization/)
How does this work when using PySpark? If I have a class as follows:
class C0(object):
  def func0(arg):
    ...
  def func1(rdd):
    result = rdd.map(lambda x: self.func0(x))
Does this result to pickling the full C0 instances? if yes what's the correct way to avoid it?
Thanks.
This does result in pickling of the full C0 instance, according to this documentation: http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark.
In order to avoid it, do something like:
class C0(object):
  def func0(self, arg): # added self
    ...
  def func1(self, rdd): # added self
    func = self.func0
    result = rdd.map(lambda x: func(x))
Moral of the story: avoid the self keyword anywhere in a map call. Spark can be smart about serializing a single function if it can calculate the function in a local closure, but any reference to self forces spark to serialize your entire object.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With