Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simple way to parallelize embarrassingly parallelizable generator

I have a generator (or, a list of generators). Let's call them gens

Each generator in gens is a complicated function that returns the next value of a complicated procedure. Fortunately, they are all independent of one another.

I want to call gen.__next__() for each element gen in gens, and return the resulting values in a list. However, multiprocessing is unhappy with pickling generators.

Is there a fast, simple way to do this in Python? I would like it such that gens of length m is mapped to n cores locally on my machine, where n could be larger or smaller than m. Each generator should run on a separate core.

If this is possible, can someone provide a minimal example?

like image 262
user650261 Avatar asked Nov 20 '25 23:11

user650261


1 Answers

You can't pickle generators. Read more about it here.

There is a blog post which explains it in much more detail. Referring a quote from it:

Let’s ignore that problem for a moment and look what we would need to do to pickle a generator. Since a generator is essentially a souped-up function, we would need to save its bytecode, which is not guarantee to be backward-compatible between Python’s versions, and its frame, which holds the state of the generator such as local variables, closures and the instruction pointer. And this latter is rather cumbersome to accomplish, since it basically requires to make the whole interpreter picklable. So, any support for pickling generators would require a large number of changes to CPython’s core.

Now if an object unsupported by pickle (e.g., a file handle, a socket, a database connection, etc) occurs in the local variables of a generator, then that generator could not be pickled automatically, regardless of any pickle support for generators we might implement. So in that case, you would still need to provide custom getstate and setstate methods. This problem renders any pickling support for generators rather limited.

He also suggests a solution, to use simple iterators.

the best solution to this problem to the rewrite the generators as simple iterators (i.e., one with a __next__ method). Iterators are easy and efficient space-wise to pickle because their state is explicit. You would still need to handle objects representing some external state explicitly however; you cannot get around this.

Another offered solution (which I haven't tried) suggests this

  1. Con­vert the gen­er­a­tor to a class in which the gen­er­a­tor code is the __iter__ method

  2. Add __getstate__ and __setstate__ meth­ods to the class, to han­dling pick­ling. Remem­ber that you can’t pickle file objects. So __setstate__ will have to re-open files, as necessary.

like image 182
Chen A. Avatar answered Nov 22 '25 13:11

Chen A.