Looks like umap training completely hangs if it is run inside a multiprocessing.Process. Minimum example on py3.8.5:
import umap
import multiprocessing
import numpy as np
import sys
import time
def train_model(q=None):
embeddings = np.random.rand(100, 512)
reducer = umap.UMAP()
print("Got reducer, about to start training")
sys.stdout.flush()
if not q:
return reducer.fit_transform(embeddings)
print("outputting to q")
q.put(reducer.fit_transform(embeddings))
print("output to q")
start = time.time()
model_output = train_model()
print('normal took: ', time.time() - start)
print('got: ', model_output)
start = time.time()
q = multiprocessing.Queue()
p = multiprocessing.Process(target=train_model, args=(q,), daemon=True)
p.start()
model_output = q.get()
print('multi took: ', time.time() - start)
print('got: ', model_output)
This results in the following output:
(env) amol@amol-small:~/code/soot/api-server/src$ python umap_multiprocessing_test.py
2021-06-24 16:09:46.233186: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such
file or directory
2021-06-24 16:09:46.233212: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Got reducer, about to start training
normal took: 7.140857934951782
got: [[ 5.585276 10.613853 ]
[ 3.6862304 8.075892 ]
[ 4.7457848 8.287621 ]
[ 3.1373663 9.443794 ]
[ 3.3923576 8.651798 ]
[ 5.8636594 10.131909 ]
[ 3.6680114 11.535476 ]
[ 1.924135 9.987121 ]
[ 4.9095764 8.643579 ]
...
[ 4.6614685 9.943193 ]
[ 3.5867712 10.872507 ]
[ 4.8476524 10.628259 ]]
Got reducer, about to start training
outputting to q
after which I have to cntrl-C because nothing happens.
I posted an issue to the umap GitHub but unfortunately the maintainer didn't have anything: https://github.com/lmcinnes/umap/issues/707#issuecomment-867964379
After digging deep into the umap library, I found that the hang was being caused by a numba prange call. On a hunch I started changing the numba threading backend. I was using the default workqueue backend to start, which was just hanging. I switched to 'omp':
numba.config.THREADING_LAYER = 'omp'
Which didn't work but DID throw an actual error:
Terminating: fork() called from a process already using GNU OpenMP, this is unsafe.
Then I tried tbb (which was annoying to install, see: https://github.com/numba/numba/issues/7148), and that made everything work.
This is...a bit unsatisfactory. I still have no idea why it failed, and the whole thing seems kinda brittle. But it solves the problem I have, so I suppose this is sufficient for now.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With