We know that multiprocessing.Pool
must be initialized after the definitions of functions to run on them. However I found the code below is inscrutable for me
import os
from multiprocessing import Pool
def func(i): print('first')
pool1 = Pool(2)
pool1.map(func, range(2)) #map-1
def func(i): print('second')
func2 = func
print('------')
pool1.map(func, range(2)) #map-2
pool1.map(func2, range(2)) #map-3
pool2 = Pool(2)
print('------')
pool2.map(func, range(2)) #map-4
pool2.map(func2, range(2)) #map-5
The output (python2.7 and python3.4 on linux) is
first #map-1
first
------
first #map-2
first
first #map-3
first
------
second #map-4
second
second #map-5
second
map-2
print 'first'
just as we expected.
But how does map-3
find the name func2
? I mean pool1
is initialized before the func2
's first occurrence. So func2 = func
is indeed executed, while def func(i): print('second')
is not. Why?
And if I define func2 directly by
def func2(i): print('second')
Then map-3
won't find name func2
as mentioned by many posts, eg. this one. What's the difference between two cases?
As I understand the arguments are passed to the slave processes by pickling, but
how does pool
pass the called function to other processes? Or how do sub-processes find the called function?
tl;dr: the issue at map-3
where the first func
is being called, when one would expect the second func
to be is due to the fact that Pool.map()
serializes func.__name__
with pickle which resolves to func
even though it's being assigned to the func2
reference, and is sent to the child process, which looks up the func
locally to the child process.
ok so I can count four different questions, listed below, and I consider you're already lectured about namespaces and forking processes, to get straight into the fun of your question ☺
① But how does map-3 find the name func2?
② So func2 = func is indeed executed, while def func(i): print('second') is not. Why?
③ Then map-3 won't find name func2 as mentioned by many posts, eg. this one. What's the difference between two cases?
④ As I understand the arguments are passed to the slave processes by pickling, but how does pool pass the called function to other processes? Or how do sub-processes find the called function?
So I've added a bit more code, to show off more of the internals:
import os
from multiprocessing import Pool
print(os.getpid(), 'parent')
def func(i):
print(os.getpid(), 'first', end=" | ")
if 'func' in globals():
print(globals()['func'], end=" | ")
else:
print("no func in globals", end=" | ")
if 'func2' in globals():
print(globals()['func2'])
else:
print("no func2 in globals")
print('------ map-1')
pool1 = Pool(2)
pool1.map(func, range(2)) #map-1
def func(i):
print(os.getpid(), 'second', end=" | ")
if 'func' in globals():
print(globals()['func'], end=" | ")
else:
print("no func in globals", end=" | ")
if 'func2' in globals():
print(globals()['func2'])
else:
print("no func2 in globals")
func2 = func
print('------ map-2')
pool1.map(func, range(2)) #map-2
print('------ map-3')
pool1.map(func2, range(2)) #map-3
pool2 = Pool(2)
print('------ map-4')
pool2.map(func, range(2)) #map-4
print('------ map-5')
pool2.map(func2, range(2)) #map-5
which outputs on my system:
21512 parent
------ map-1
21513 first | <function func at 0x7f62d67f7cf8> | no func2 in globals
21514 first | <function func at 0x7f62d67f7cf8> | no func2 in globals
------ map-2
21513 first | <function func at 0x7f62d67f7cf8> | no func2 in globals
21514 first | <function func at 0x7f62d67f7cf8> | no func2 in globals
------ map-3
21513 first | <function func at 0x7f62d67f7cf8> | no func2 in globals
21514 first | <function func at 0x7f62d67f7cf8> | no func2 in globals
------ map-4
21518 second | <function func at 0x7f62d531bed8> | <function func at 0x7f62d531bed8>
21519 second | <function func at 0x7f62d531bed8> | <function func at 0x7f62d531bed8>
------ map-5
21518 second | <function func at 0x7f62d531bed8> | <function func at 0x7f62d531bed8>
21519 second | <function func at 0x7f62d531bed8> | <function func at 0x7f62d531bed8>
so, we can see that for pool1
there never is a func2
being added to the namespace. So there is definitely something fishy going on there, and it's being too late for me to look thoroughly at the source of multiprocessing
and at the debugger to understand what's going on.
So if I had to guess an answer to ①, the pickle
module is finding out somehow that func2
resolves to 0x7f62d531bed8
, which already exists with the tag func
, thus it pickles the already known "label" func
on the children side, resolving there to 0x7f62d67f7cf8
. i.e.:
func2 → 0x7f62d531bed8 → func → [PICKLE] → globals()['func'] → 0x7f62d67f7cf8
To test my theory, I changed your code a bit, by renaming the second func()
into func2()
and here is what I got:
------ map-3
Process PoolWorker-1:
Process PoolWorker-2:
Traceback (most recent call last):
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
self._target(*self._args, **self._kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
task = get()
task = get()
File "/usr/lib/python2.7/multiprocessing/queues.py", line 376, in get
File "/usr/lib/python2.7/multiprocessing/queues.py", line 376, in get
return recv()
return recv()
AttributeError: 'module' object has no attribute 'func2'
AttributeError: 'module' object has no attribute 'func2'
and then changing as well func = func2
into func2 = func
------ map-2
Process PoolWorker-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Process PoolWorker-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
self._target(*self._args, **self._kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
task = get()
task = get()
File "/usr/lib/python2.7/multiprocessing/queues.py", line 376, in get
File "/usr/lib/python2.7/multiprocessing/queues.py", line 376, in get
return recv()
return recv()
AttributeError: 'module' object has no attribute 'func2'
AttributeError: 'module' object has no attribute 'func2'
So I believe I'm starting to make a point. And also, it shows where to read the code to understand what's going on, on the children processes side.
So that more clues to answer ② and ③!
To get further, I added a print statement within pool.py
line 114:
job, i, func, args, kwds = task
print("XXX", os.getpid(), job, i, func, args, kwds)
to show what's going on. And we can see that func
is resolved to 0x7f2d0238fcf8
, which is the same address as within the parent function:
23432 parent
------ map-1
('XXX', 23433, 0, 0, <function mapstar at 0x7f2d02363230>, ((<function func at 0x7f2d0238fcf8>, (0,)),), {})
23433 first | <function func at 0x7f2d0238fcf8> | no func2 in globals
('XXX', 23434, 0, 1, <function mapstar at 0x7f2d02363230>, ((<function func at 0x7f2d0238fcf8>, (1,)),), {})
23434 first | <function func at 0x7f2d0238fcf8> | no func2 in globals
------ map-2
('XXX', 23433, 1, 0, <function mapstar at 0x7f2d02363230>, ((<function func at 0x7f2d0238fcf8>, (0,)),), {})
23433 first | <function func at 0x7f2d0238fcf8> | no func2 in globals
('XXX', 23434, 1, 1, <function mapstar at 0x7f2d02363230>, ((<function func at 0x7f2d0238fcf8>, (1,)),), {})
23434 first | <function func at 0x7f2d0238fcf8> | no func2 in globals
------ map-3
('XXX', 23433, 2, 0, <function mapstar at 0x7f2d02363230>, ((<function func at 0x7f2d0238fcf8>, (0,)),), {})
23433 first | <function func at 0x7f2d0238fcf8> | no func2 in globals
('XXX', 23434, 2, 1, <function mapstar at 0x7f2d02363230>, ((<function func at 0x7f2d0238fcf8>, (1,)),), {})
23434 first | <function func at 0x7f2d0238fcf8> | no func2 in globals
------ map-4
('XXX', 23438, 3, 0, <function mapstar at 0x7f2d02363230>, ((<function func at 0x1092e60>, (0,)),), {})
23438 second | <function func at 0x1092e60> | <function func at 0x1092e60>
('XXX', 23439, 3, 1, <function mapstar at 0x7f2d02363230>, ((<function func at 0x1092e60>, (1,)),), {})
23439 second | <function func at 0x1092e60> | <function func at 0x1092e60>
------ map-5
('XXX', 23438, 4, 0, <function mapstar at 0x7f2d02363230>, ((<function func at 0x1092e60>, (0,)),), {})
('XXX', 23439, 4, 1, <function mapstar at 0x7f2d02363230>, ((<function func at 0x1092e60>, (1,)),), {})
23438 second | <function func at 0x1092e60> | <function func at 0x1092e60>
23439 second | <function func at 0x1092e60> | <function func at 0x1092e60>
So to answer ④, we'd need to dig further in the multiprocessing sources, and even maybe within the pickle sources.
But I guess my feeling about the resolution is likely to be right… And then the only remaining question is why does it resolves labels to addresses and back to labels again, before pushing that to children processes!
edit: I think I know why! As I was going to bed, the reason popped in my head, so I just went back to my keyboard:
When pickling the function, pickles takes the argument containing the function, and gets its name from the function's object itself:
so even though if you do create a new function object, for which you do get a different address in memory:
>>> print(func)
<function func at 0x7fc6174e3ed8>
pickles doesn't care, because if the function is not already accessible by the child, it will never be made accessible. So pickle only resolves func.__name__
:
>>> print("func.__name__:", func.__name__)
func.__name__: func
>>> print("func2.__name__:", func2.__name__)
func2.__name__: func
and then, even though you changed the function's body on the parent thread, and you made a new reference to that function, what really gets pickled is the internal name of the function, which is given when the lambda gets assigned or the function is defined.
This explains why you get the old func
function when you give func2
to the pool1
at the map-3
stage.
So as a conclusion, for ① map-3
does not find the name func2
, it find the name func
within the function referred by func2
. So, that also answers ② & ③, as because the func
being found is executing the original func
function. And the mechanism, is the fact that func.__name__
is being used to pickle and resolve the function name between the two processes, answering ④.
Last update, from you:
In pickle._Pickler.save_global
, it gets the name using
if name is None: name = getattr(obj, '__qualname__', None)
then again
if name is None: name = obj.__name__.
So if the obj has no __qualname__
then __name__
will be used.
However it will check if the object passed is same with the one in subprocess:
if obj2 is not obj: raise PicklingError(...)
where obj2, parent = _getattribute(module, name)
.
yup, but remember that the object passed is just the (internal) name of the function, not the function itself. The child process has no way of finding out whether his func()
is the same as the parent's func()
in memory.
Edit from @SyrtisMajor:
OK, let's change the first code above:
import os
from multiprocessing import Pool
print(os.getpid(), 'parent')
def func(i):
print(os.getpid(), 'first', end=" | ")
if 'func' in globals():
print(globals()['func'], end=" | ")
else:
print("no func in globals", end=" | ")
if 'func2' in globals():
print(globals()['func2'])
else:
print("no func2 in globals")
print('------ map-1')
pool1 = Pool(2)
pool1.map(func, range(2)) #map-1
def func2(i):
print(os.getpid(), 'second', end=" | ")
if 'func' in globals():
print(globals()['func'], end=" | ")
else:
print("no func in globals", end=" | ")
if 'func2' in globals():
print(globals()['func2'])
else:
print("no func2 in globals")
func2.__qualname__ = func.__qualname__
func = func2
print('------ map-2')
pool1.map(func, range(2)) #map-2
print('------ map-3')
pool1.map(func2, range(2)) #map-3
pool2 = Pool(2)
print('------ map-4')
pool2.map(func, range(2)) #map-4
print('------ map-5')
pool2.map(func2, range(2)) #map-5
The outputs is as follow:
38130 parent
------ map-1
38131 first | <function func at 0x101856f28> | no func2 in globals
38132 first | <function func at 0x101856f28> | no func2 in globals
------ map-2
38131 first | <function func at 0x101856f28> | no func2 in globals
38132 first | <function func at 0x101856f28> | no func2 in globals
------ map-3
38131 first | <function func at 0x101856f28> | no func2 in globals
38132 first | <function func at 0x101856f28> | no func2 in globals
------ map-4
38133 second | <function func at 0x10339b510> | <function func at 0x10339b510>
38134 second | <function func at 0x10339b510> | <function func at 0x10339b510>
------ map-5
38133 second | <function func at 0x10339b510> | <function func at 0x10339b510>
38134 second | <function func at 0x10339b510> | <function func at 0x10339b510>
It's exactly the same as our first output. And note that func = func2
after the definition of func2
is key as pickle will check if the func2
(with name func
) is same as __main__.func
. If no, then pickling will fail.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With