Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark serializing the 'self' referenced object in map lambdas?

As far as I understand, while using the Spark Scala interface we have to be careful not to unnecessarily serialize a full object when only one or two attributes are needed: (http://erikerlandson.github.io/blog/2015/03/31/hygienic-closures-for-scala-function-serialization/)

How does this work when using PySpark? If I have a class as follows:

class C0(object):

  def func0(arg):
    ...

  def func1(rdd):
    result = rdd.map(lambda x: self.func0(x))

Does this result to pickling the full C0 instances? if yes what's the correct way to avoid it?

Thanks.

like image 561
x89a10 Avatar asked Oct 31 '25 12:10

x89a10


1 Answers

This does result in pickling of the full C0 instance, according to this documentation: http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark.

In order to avoid it, do something like:

class C0(object):

  def func0(self, arg): # added self
    ...

  def func1(self, rdd): # added self
    func = self.func0
    result = rdd.map(lambda x: func(x))

Moral of the story: avoid the self keyword anywhere in a map call. Spark can be smart about serializing a single function if it can calculate the function in a local closure, but any reference to self forces spark to serialize your entire object.

like image 197
Matt Messersmith Avatar answered Nov 03 '25 02:11

Matt Messersmith