How to fix slow performance on large datasets with spaCy (nlp.pipe) for preprocessing

Question

I'm having an issue with spaCy 2.1 whereby it is taking a very long time to preprocess some texts in English and German so that I can use them in a machine translation-related project. After doing a brief cleaning using regex, I am using spaCy's nlp.pipe() function to carry out a few processes (lemmatization, tagging each word with parts of speech, and splitting compound words in German [which I developed myself]), but the issue is that it is taking a long time, and I'm wondering if there's a better approach to use that would speed up things.

The dataset I am using is very large: consisting of Project Gutenberg ebooks in English and German, as well as a selection of news articles in both language and the entire Wikipedia database for both languages. I am running this code on my college's HPC grid, where I can allocate up to 40 CPU cores and 250GB RAM to each job, or a selection of NVIDIA GPUs up to an RTX 2080 Ti. No matter which combination I try, it seems to take multiple days to go past even the lemmatization stage for each text.

I've tried using joblib to help speed things up by trying to use multiple cores more, as well as using multiprocessing to do much the same thing. Neither seem to have that much of an effect. I have also tried tweaking the batch size to no avail.


    clean_en_texts_step1 = [cleaning(doc) for doc in NLP_EN.pipe(en_texts, batch_size=100)]
    clean_en_texts = [tag_pos(doc) for doc in NLP_EN.pipe(clean_en_texts_step1, batch_size=100)]
    clean_de_texts_step1 = [cleaning(doc) for doc in NLP_DE.pipe(de_texts, batch_size=100)]  
    compound_split = [split_compound_pipe(doc) for doc in NLP_DE.pipe(clean_de_texts_step1, batch_size=100)]        
    clean_de_texts = [tag_pos(doc) for doc in NLP_DE.pipe(compound_split, batch_size=100)]

I would expect the pipes to be much quicker than they already are (instead of taking multiple days just to complete the first step.

l.augustyniak · Accepted Answer

I advise using multiprocessing at the top of nlp.pipe as it is presented here https://spacy.io/usage/examples#multi-processing. Unfortunately, due to python's GIL and multithreading problems the nlp.pipe(n_threads=xxx) n_threads is deprecated (https://spacy.io/api/language#pipe).

How to fix slow performance on large datasets with spaCy (nlp.pipe) for preprocessing

Tags:

python

python-multiprocessing

joblib

spacy

jackyboy633

1 Answers

l.augustyniak

Recent Activity

Donate For Us

How to fix slow performance on large datasets with spaCy (nlp.pipe) for preprocessing

Tags:

python

python-multiprocessing

joblib

spacy

jackyboy633

1 Answers

l.augustyniak

Related questions

Recent Activity

Donate For Us