Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed up SpaCy tokenizer

I am tokenizing tens of thousands of documents using SpaCy. On average it is taking about 5 seconds per document. Any suggestions on how to speed up the tokenizer?

Some additional information:

  • Input files are text files with new line characters
  • Average size of file is about 400KB
  • Each input file's tokens are written to a new line in an output file (though i can change this if it helps to increase speed)
  • There are 1655 stopwords
  • The output file is feed to fasttext

The following is my code:

from pathlib import Path, PurePath
from time import time

st = time()
nlp = en_core_web_sm.load(disable = ['ner', 'tagger', 'parser', 'textcat'])
p = Path('input_text/').glob('*.txt')
files = ['input_text/' + x.name for x in p if x.is_file()]

#nlp = spacy.load('en-core-web-sm')

stopwords_file = 'stopwords.txt'

def getStopWords():
    f = open(stopwords_file, 'r')
    stopWordsSet = f.read()
    return stopWordsSet

stopWordsSet = getStopWords()
out_file = 'token_results.txt'
for file in files:
    #print (out_file)
    with open(file, encoding="utf8") as f:
        st_doc = time()
        for line in f:

            doc = nlp(line)

            for token in doc:
                if (not token.text.lower() in stopWordsSet
                    and not token.is_punct and not token.is_space and not token.like_num
                    and len(token.shape_)>1):                    

                    tup = (token.text, '|', token.lemma_)

                    appendFile = open(out_file, 'a', encoding="utf-8")
                    appendFile.write(" " + tup[0])
        print((time() -st_doc), 'seconds elasped for', file)
        appendFile.write('\n')
        appendFile.close()
print((time()-st)/60, 'minutes elasped')
like image 732
Britt Avatar asked Dec 28 '25 08:12

Britt


1 Answers

  1. The main problem: open your output file once and leave it open until the end of your script. Repeatedly closing and reopening and seeking to the end of an ever larger text file is going to be extremely slow.

  2. Read the stopwords into an actual set(). Otherwise you're searching for each token in a long string containing the whole file, which accidentally matches partial words and is much much slower than checking for set membership.

  3. Use nlp.pipe() or for tokenization just nlp.tokenizer.pipe() to speed up the spacy part a bit. With a bunch of short one-sentence documents this doesn't seem to make a huge difference. It is much faster to tokenize one large document rather than treating each line as an individual document, but whether you want to do that depends on how your data is structured. If you're just tokenizing, you can increase the maximum document size (nlp.max_length) if you need to.

texts = f.readlines()
docs = nlp.tokenizer.pipe(texts)

for doc in docs:
    for token in doc:
        ...
like image 145
aab Avatar answered Dec 30 '25 23:12

aab



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!