I have many many small tasks to do in a for loop. I Want to use concurrency to speed it up. I used joblib for its easy to integrate. However, I found using joblib makes my program run much slower than a simple for iteration. Here is the demo code:
import time
import random
from os import path
import tempfile
import numpy as np
import gc
from joblib import Parallel, delayed, load, dump
def func(a, i):
    '''a simple task for demonstration'''
    a[i] = random.random()
def memmap(a):
    '''use memory mapping to prevent memory allocation for each worker'''
    tmp_dir = tempfile.mkdtemp()
    mmap_fn = path.join(tmp_dir, 'a.mmap')
    print 'mmap file:', mmap_fn
    _ = dump(a, mmap_fn)        # dump
    a_mmap = load(mmap_fn, 'r+') # load
    del a
    gc.collect()
    return a_mmap
if __name__ == '__main__':
    N = 10000
    a = np.zeros(N)
    # memory mapping
    a = memmap(a)
    # parfor
    t0 = time.time()
    Parallel(n_jobs=4)(delayed(func)(a, i) for i in xrange(N))
    t1 = time.time()-t0
    # for 
    t0 = time.time()
    [func(a, i) for i in xrange(N)]
    t2 = time.time()-t0  
    # joblib time vs for time
    print t1, t2
On my laptop with i5-2520M CPU, 4 cores, Win7 64bit, the running time is 6.464s for joblib and 0.004s for simplely for loop.
I've made the arguments as memory mapping to prevent the overhead of reallocation for each worker. I've red this relative post, still not solved my problem. Why is that happen? Did I missed some disciplines to correctly use joblib?
"Many small tasks" are not a good fit for joblib. The coarser the task granularity, the less overhead joblib causes and the more benefit you will have from it. With tiny tasks, the cost of setting up worker processes and communicating data to them will outweigh any any benefit from parallelization.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With