Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python multiprocessing Pool Interaction With Namespace At Creation

We know that multiprocessing.Pool must be initialized after the definitions of functions to run on them. However I found the code below is inscrutable for me

import os
from multiprocessing import Pool

def func(i): print('first')

pool1 = Pool(2)
pool1.map(func, range(2))         #map-1

def func(i): print('second')
func2 = func

print('------')
pool1.map(func,  range(2))        #map-2
pool1.map(func2,  range(2))       #map-3

pool2 = Pool(2)
print('------')
pool2.map(func,   range(2))       #map-4
pool2.map(func2,  range(2))       #map-5

The output (python2.7 and python3.4 on linux) is

first         #map-1
first
------
first         #map-2
first
first         #map-3
first
------
second        #map-4
second
second        #map-5
second

map-2 print 'first' just as we expected. But how does map-3 find the name func2? I mean pool1 is initialized before the func2 's first occurrence. So func2 = func is indeed executed, while def func(i): print('second') is not. Why?

And if I define func2 directly by

def func2(i): print('second')

Then map-3 won't find name func2 as mentioned by many posts, eg. this one. What's the difference between two cases?

As I understand the arguments are passed to the slave processes by pickling, but how does pool pass the called function to other processes? Or how do sub-processes find the called function?

like image 698
Syrtis Major Avatar asked Sep 06 '25 03:09

Syrtis Major


1 Answers

tl;dr: the issue at map-3 where the first func is being called, when one would expect the second func to be is due to the fact that Pool.map() serializes func.__name__ with pickle which resolves to func even though it's being assigned to the func2 reference, and is sent to the child process, which looks up the func locally to the child process.



ok so I can count four different questions, listed below, and I consider you're already lectured about namespaces and forking processes, to get straight into the fun of your question ☺

① But how does map-3 find the name func2?

② So func2 = func is indeed executed, while def func(i): print('second') is not. Why?

③ Then map-3 won't find name func2 as mentioned by many posts, eg. this one. What's the difference between two cases?

④ As I understand the arguments are passed to the slave processes by pickling, but how does pool pass the called function to other processes? Or how do sub-processes find the called function?

So I've added a bit more code, to show off more of the internals:

import os
from multiprocessing import Pool

print(os.getpid(), 'parent')

def func(i):
    print(os.getpid(), 'first', end=" | ")
    if 'func' in globals():
        print(globals()['func'], end=" | ")
    else:
        print("no func in globals", end=" | ")
    if 'func2' in globals():
        print(globals()['func2'])
    else:
        print("no func2 in globals")

print('------ map-1')
pool1 = Pool(2)
pool1.map(func, range(2))         #map-1

def func(i):
    print(os.getpid(), 'second', end=" | ")
    if 'func' in globals():
        print(globals()['func'], end=" | ")
    else:
        print("no func in globals", end=" | ")
    if 'func2' in globals():
        print(globals()['func2'])
    else:
        print("no func2 in globals")
func2 = func

print('------ map-2')
pool1.map(func,  range(2))        #map-2
print('------ map-3')
pool1.map(func2,  range(2))       #map-3

pool2 = Pool(2)
print('------ map-4')
pool2.map(func,   range(2))       #map-4
print('------ map-5')
pool2.map(func2,  range(2))       #map-5

which outputs on my system:

21512 parent
------ map-1
21513 first | <function func at 0x7f62d67f7cf8> | no func2 in globals
21514 first | <function func at 0x7f62d67f7cf8> | no func2 in globals
------ map-2
21513 first | <function func at 0x7f62d67f7cf8> | no func2 in globals
21514 first | <function func at 0x7f62d67f7cf8> | no func2 in globals
------ map-3
21513 first | <function func at 0x7f62d67f7cf8> | no func2 in globals
21514 first | <function func at 0x7f62d67f7cf8> | no func2 in globals
------ map-4
21518 second | <function func at 0x7f62d531bed8> | <function func at 0x7f62d531bed8>
21519 second | <function func at 0x7f62d531bed8> | <function func at 0x7f62d531bed8>
------ map-5
21518 second | <function func at 0x7f62d531bed8> | <function func at 0x7f62d531bed8>
21519 second | <function func at 0x7f62d531bed8> | <function func at 0x7f62d531bed8>

so, we can see that for pool1 there never is a func2 being added to the namespace. So there is definitely something fishy going on there, and it's being too late for me to look thoroughly at the source of multiprocessing and at the debugger to understand what's going on.

So if I had to guess an answer to ①, the pickle module is finding out somehow that func2 resolves to 0x7f62d531bed8, which already exists with the tag func, thus it pickles the already known "label" func on the children side, resolving there to 0x7f62d67f7cf8. i.e.:

func2 → 0x7f62d531bed8 → func → [PICKLE] → globals()['func'] → 0x7f62d67f7cf8

To test my theory, I changed your code a bit, by renaming the second func() into func2() and here is what I got:

------ map-3
Process PoolWorker-1:
Process PoolWorker-2:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    task = get()
    task = get()
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 376, in get
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 376, in get
    return recv()
    return recv()
AttributeError: 'module' object has no attribute 'func2'
AttributeError: 'module' object has no attribute 'func2'

and then changing as well func = func2 into func2 = func

------ map-2
Process PoolWorker-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Process PoolWorker-2:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    task = get()
    task = get()
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 376, in get
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 376, in get
    return recv()
    return recv()
AttributeError: 'module' object has no attribute 'func2'
AttributeError: 'module' object has no attribute 'func2'

So I believe I'm starting to make a point. And also, it shows where to read the code to understand what's going on, on the children processes side.

So that more clues to answer ② and ③!

To get further, I added a print statement within pool.py line 114:

    job, i, func, args, kwds = task
    print("XXX", os.getpid(), job, i, func, args, kwds)

to show what's going on. And we can see that func is resolved to 0x7f2d0238fcf8, which is the same address as within the parent function:

23432 parent
------ map-1
('XXX', 23433, 0, 0, <function mapstar at 0x7f2d02363230>, ((<function func at 0x7f2d0238fcf8>, (0,)),), {})
23433 first | <function func at 0x7f2d0238fcf8> | no func2 in globals
('XXX', 23434, 0, 1, <function mapstar at 0x7f2d02363230>, ((<function func at 0x7f2d0238fcf8>, (1,)),), {})
23434 first | <function func at 0x7f2d0238fcf8> | no func2 in globals
------ map-2
('XXX', 23433, 1, 0, <function mapstar at 0x7f2d02363230>, ((<function func at 0x7f2d0238fcf8>, (0,)),), {})
23433 first | <function func at 0x7f2d0238fcf8> | no func2 in globals
('XXX', 23434, 1, 1, <function mapstar at 0x7f2d02363230>, ((<function func at 0x7f2d0238fcf8>, (1,)),), {})
23434 first | <function func at 0x7f2d0238fcf8> | no func2 in globals
------ map-3
('XXX', 23433, 2, 0, <function mapstar at 0x7f2d02363230>, ((<function func at 0x7f2d0238fcf8>, (0,)),), {})
23433 first | <function func at 0x7f2d0238fcf8> | no func2 in globals
('XXX', 23434, 2, 1, <function mapstar at 0x7f2d02363230>, ((<function func at 0x7f2d0238fcf8>, (1,)),), {})
23434 first | <function func at 0x7f2d0238fcf8> | no func2 in globals
------ map-4
('XXX', 23438, 3, 0, <function mapstar at 0x7f2d02363230>, ((<function func at 0x1092e60>, (0,)),), {})
23438 second | <function func at 0x1092e60> | <function func at 0x1092e60>
('XXX', 23439, 3, 1, <function mapstar at 0x7f2d02363230>, ((<function func at 0x1092e60>, (1,)),), {})
23439 second | <function func at 0x1092e60> | <function func at 0x1092e60>
------ map-5
('XXX', 23438, 4, 0, <function mapstar at 0x7f2d02363230>, ((<function func at 0x1092e60>, (0,)),), {})
('XXX', 23439, 4, 1, <function mapstar at 0x7f2d02363230>, ((<function func at 0x1092e60>, (1,)),), {})
23438 second | <function func at 0x1092e60> | <function func at 0x1092e60>
23439 second | <function func at 0x1092e60> | <function func at 0x1092e60>

So to answer ④, we'd need to dig further in the multiprocessing sources, and even maybe within the pickle sources.

But I guess my feeling about the resolution is likely to be right… And then the only remaining question is why does it resolves labels to addresses and back to labels again, before pushing that to children processes!


edit: I think I know why! As I was going to bed, the reason popped in my head, so I just went back to my keyboard:

When pickling the function, pickles takes the argument containing the function, and gets its name from the function's object itself:

so even though if you do create a new function object, for which you do get a different address in memory:

>>> print(func)
<function func at 0x7fc6174e3ed8>

pickles doesn't care, because if the function is not already accessible by the child, it will never be made accessible. So pickle only resolves func.__name__:

>>> print("func.__name__:", func.__name__)
func.__name__: func
>>> print("func2.__name__:", func2.__name__)
func2.__name__: func

and then, even though you changed the function's body on the parent thread, and you made a new reference to that function, what really gets pickled is the internal name of the function, which is given when the lambda gets assigned or the function is defined.

This explains why you get the old func function when you give func2 to the pool1 at the map-3 stage.

So as a conclusion, for ① map-3 does not find the name func2, it find the name func within the function referred by func2. So, that also answers ② & ③, as because the func being found is executing the original func function. And the mechanism, is the fact that func.__name__ is being used to pickle and resolve the function name between the two processes, answering ④.


Last update, from you:

In pickle._Pickler.save_global, it gets the name using

if name is None: name = getattr(obj, '__qualname__', None)

then again

if name is None: name = obj.__name__. 

So if the obj has no __qualname__ then __name__ will be used.

However it will check if the object passed is same with the one in subprocess:

if obj2 is not obj: raise PicklingError(...) 

where obj2, parent = _getattribute(module, name).

yup, but remember that the object passed is just the (internal) name of the function, not the function itself. The child process has no way of finding out whether his func() is the same as the parent's func() in memory.


Edit from @SyrtisMajor:

OK, let's change the first code above:

import os
from multiprocessing import Pool

print(os.getpid(), 'parent')

def func(i):
    print(os.getpid(), 'first', end=" | ")
    if 'func' in globals():
        print(globals()['func'], end=" | ")
    else:
        print("no func in globals", end=" | ")
    if 'func2' in globals():
        print(globals()['func2'])
    else:
        print("no func2 in globals")

print('------ map-1')
pool1 = Pool(2)
pool1.map(func, range(2))         #map-1

def func2(i):
    print(os.getpid(), 'second', end=" | ")
    if 'func' in globals():
        print(globals()['func'], end=" | ")
    else:
        print("no func in globals", end=" | ")
    if 'func2' in globals():
        print(globals()['func2'])
    else:
        print("no func2 in globals")

func2.__qualname__ = func.__qualname__   

func = func2

print('------ map-2')
pool1.map(func,  range(2))        #map-2
print('------ map-3')
pool1.map(func2,  range(2))       #map-3

pool2 = Pool(2)
print('------ map-4')
pool2.map(func,   range(2))       #map-4
print('------ map-5')
pool2.map(func2,  range(2))       #map-5

The outputs is as follow:

38130 parent
------ map-1
38131 first | <function func at 0x101856f28> | no func2 in globals
38132 first | <function func at 0x101856f28> | no func2 in globals
------ map-2
38131 first | <function func at 0x101856f28> | no func2 in globals
38132 first | <function func at 0x101856f28> | no func2 in globals
------ map-3
38131 first | <function func at 0x101856f28> | no func2 in globals
38132 first | <function func at 0x101856f28> | no func2 in globals
------ map-4
38133 second | <function func at 0x10339b510> | <function func at 0x10339b510>
38134 second | <function func at 0x10339b510> | <function func at 0x10339b510>
------ map-5
38133 second | <function func at 0x10339b510> | <function func at 0x10339b510>
38134 second | <function func at 0x10339b510> | <function func at 0x10339b510>

It's exactly the same as our first output. And note that func = func2 after the definition of func2 is key as pickle will check if the func2 (with name func) is same as __main__.func. If no, then pickling will fail.

like image 170
zmo Avatar answered Sep 07 '25 17:09

zmo