Python's multiprocessing.Pool.imap is very convenient to process large files line by line:
import multiprocessing
def process(line):
processor = Processor('some-big.model') # this takes time to load...
return processor.process(line)
if __name__ == '__main__':
pool = multiprocessing.Pool(4)
with open('lines.txt') as infile, open('processed-lines.txt', 'w') as outfile:
for processed_line in pool.imap(process, infile):
outfile.write(processed_line)
How can I make sure that helpers such as Processor in the example above are loaded only once? Is this possible at all without resorting to a more complicated/verbose structure involving queues?
multiprocessing.Pool allows for resource initialisation via the initializer and initarg parameters. I was surprised to learn that the idea is to make use of global variables, as illustrated below:
import multiprocessing as mp
def init_process(model):
global processor
processor = Processor(model) # this takes time to load...
def process(line):
return processor.process(line) # via global variable `processor` defined in `init_process`
if __name__ == '__main__':
pool = mp.Pool(4, initializer=init_process, initargs=['some-big.model'])
with open('lines.txt') as infile, open('processed-lines.txt', 'w') as outfile:
for processed_line in pool.imap(process, infile):
outfile.write(processed_line)
The concept isn't very well described in multiprocessing.Pool's documentation, so I hope this example will be helpful to others.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With