Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiprocessing Pool gets successively slower after multiple calls

I want to iteratively train 1000 random forests on a dataset. To speed things up, I'm trying to utilize multiple cores during the iterated training loop. An working example is below:

from sklearn.ensemble import RandomForestClassifier
from multiprocessing import Pool,cpu_count
import numpy as np
import pandas as pd
from time import time

n = 2000
ndims = 5000

X = pd.DataFrame(np.random.normal(0,1,n*ndims).reshape((n,ndims)))
y = pd.Series(np.random.choice([0, 1], size=(n)))


def draw_batches(n,size=100):
    steps = np.arange(0,n,size)
    if not n%size == 0:
        steps = np.append(steps,n%size)[1:]
    for step in steps:
        if not step%size == 0:
            yield step
        else:
            yield size


def pool(method,iters):
    output = []
    p = Pool(4)
    try:
        output = p.map(method,iters)
    except Exception as e:
        print(e)
        pass
    finally:
        p.close()
        p.join()
        del p
    return output


def importances(args):
    model, i = args
    y_ = y.copy()
    model.fit(X,y_)
    return model.feature_importances_

n_iters = 100
model_cls = RandomForestClassifier

for batch in draw_batches(n_iters,4):
    print(batch)
    t = time()
    train_args = [(model_cls(n_estimators=50),i) for i in np.arange(batch)]
    imps = pool(importances,train_args)
    print((time()-t)/batch)

Though not as pronounced as in the code that I'm working, the above displays that the processing time per model gradually increases the more batches you run. I wouldn't expect this to be the case as the pool processing is all contained and everything is deleted at the end of each run.

What is causing the slow down?

like image 291
Jed Avatar asked Nov 25 '25 13:11

Jed


1 Answers

I cannot duplicate a prolonged and continuous increase with the code provided. Nevertheless, I thought I would present my findings in case they helped.

Overall time

Piping your script into feedgnuplot gives me this runtime (I've removed the division by batch, since the batches are constant in this example, and I wanted to compare time spent in other parts of the code).

overall time

I apologise for the lack of labelling. Time in seconds is on the y axis, and the x axis is just batches (hence going to 100/4 = 25). As you can see, I have lots of variation, but no sustained increase. (I tried this lots of times with different parameters, but I can't duplicate the sustained increase so I'm giving up). At no point did I run out of ram or cpu (I have 6 cores, so load was fine).

Breaking it down

Well: perhaps if I can't duplicate the problem I can still figure out what might be going on? After all, that's a nice juicy bounty... Here is a horribly unlabelled graph (sorry!):

stages

I was too lazy to feed line labels in, and I don't know how to add them once I've run gnuplot. This was produced by modifying your pool function like this:

def pool(method, iters):
    start = time()
    output = []
    p = Pool(4)
    pooled = time()
    try:
        output = p.map(method, iters)
        mapped = time()
    except Exception as e:
        print(e)
        pass
    finally:
        p.close()
        p.join()
        del p
        ended = time()
    return output, start, pooled, mapped, ended

and then outputing the total time running the pool, the time starting the pool, the time running the pool, and the time closing the pool:

processing = time() - t
print(processing, pooled - start, mapped - pooled, ended - mapped)

Now (this is why we label things!) The blue line is total time, against the left-hand y axis. The orange line is pool execution time, against the same axis. The purple line is time starting the pool, against the right-hand y axis, and the green line is time ending the pool, against the right-hand y axis.

What we can see from this is that whilst time running the pool dominates, time closing the pool is proportional to time running the pool. I do not know why this would be: my first guess is garbage collection.

Another test, to run your end:

What happens if we use one pool? Does execution slow down? If I do this, I get:

pipeline

I am uncertain exactly what causes the dropoff at the end, but it is reproducible. This was produced with:

def importances(args):
    start = time()
    model, i = args
    y_ = y.copy()
    model.fit(X, y_)
    return model.feature_importances_, time() - start

n_iters = 100
model_cls = RandomForestClassifier
p = Pool(4)

args = [(model_cls(n_estimators=50), i) for i in np.arange(n_iters)]
outputs = p.map(importances, args)
for _, runtime in outputs:
    print(runtime)

This run showed a slight reduction in execution/result, but that is purely random. The increase in runtime/function I attribute to the use of time().

Overall, you are better not batching as the pool can start a new run immediately as soon as one is finished (but you likely knew that). But I would be interested to know if this also slowed down in your environment. If so it would disprove the hypothesis that closing the pool is causing the problem.

Conclusion

I do not consider myself to have answered this question :( But perhaps we are a little closer to an answer: I speculate that on a machine and under conditions where time running the pool starts to blow up, time closing the pool will blow up even faster. Presumably with larger numbers you will start to fragment ram, or even swap, and something like this is causing the increase. (Just deleting the final object does not always prevent fragmentation, particularly if the intermediate step has used a lot of ram.)

But since I cannot duplicate, I cannot prove anything.

Assumptions

  • time() is atomic (it isn't)
  • alright, then time() takes the same amount of time every time
like image 171
2e0byo Avatar answered Nov 27 '25 01:11

2e0byo



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!