Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Training a Python UMAP model hangs in a multiprocessing.Process

Looks like umap training completely hangs if it is run inside a multiprocessing.Process. Minimum example on py3.8.5:

import umap
import multiprocessing
import numpy as np
import sys
import time


def train_model(q=None):
  embeddings = np.random.rand(100, 512)
  reducer = umap.UMAP()
  print("Got reducer, about to start training")
  sys.stdout.flush()
  if not q:
    return reducer.fit_transform(embeddings)
  print("outputting to q")
  q.put(reducer.fit_transform(embeddings))
  print("output to q")


start = time.time()
model_output = train_model()
print('normal took: ', time.time() - start)
print('got: ', model_output)

start = time.time()
q = multiprocessing.Queue()
p = multiprocessing.Process(target=train_model, args=(q,), daemon=True)
p.start()
model_output = q.get()
print('multi took: ', time.time() - start)
print('got: ', model_output)

This results in the following output:

(env) amol@amol-small:~/code/soot/api-server/src$ python umap_multiprocessing_test.py                                                                                                                              
2021-06-24 16:09:46.233186: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such
 file or directory                                                                                                                                                                                                 
2021-06-24 16:09:46.233212: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.                                                      
Got reducer, about to start training                                                                                                                                                                               
normal took:  7.140857934951782                                                                                                                                                                                    
got:  [[ 5.585276  10.613853 ]                                                                                                                                                                                     
 [ 3.6862304  8.075892 ]                                                                                                                                                                                           
 [ 4.7457848  8.287621 ]                                                                                                                                                                                           
 [ 3.1373663  9.443794 ]                                                                                                                                                                                           
 [ 3.3923576  8.651798 ]                                                                                                                                                                                           
 [ 5.8636594 10.131909 ]                                                                                                                                                                                           
 [ 3.6680114 11.535476 ]                                                                                                                                                                                           
 [ 1.924135   9.987121 ]                                                                                                                                                                                           
 [ 4.9095764  8.643579 ]                                                                                                                                                                                           
 ...
 [ 4.6614685  9.943193 ]                            
 [ 3.5867712 10.872507 ]                            
 [ 4.8476524 10.628259 ]]                           
Got reducer, about to start training
outputting to q   

after which I have to cntrl-C because nothing happens.

I posted an issue to the umap GitHub but unfortunately the maintainer didn't have anything: https://github.com/lmcinnes/umap/issues/707#issuecomment-867964379

like image 959
theahura Avatar asked Jan 18 '26 12:01

theahura


1 Answers

After digging deep into the umap library, I found that the hang was being caused by a numba prange call. On a hunch I started changing the numba threading backend. I was using the default workqueue backend to start, which was just hanging. I switched to 'omp':

numba.config.THREADING_LAYER = 'omp'

Which didn't work but DID throw an actual error:

Terminating: fork() called from a process already using GNU OpenMP, this is unsafe.

Then I tried tbb (which was annoying to install, see: https://github.com/numba/numba/issues/7148), and that made everything work.

This is...a bit unsatisfactory. I still have no idea why it failed, and the whole thing seems kinda brittle. But it solves the problem I have, so I suppose this is sufficient for now.

like image 101
theahura Avatar answered Jan 20 '26 03:01

theahura



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!