Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently count bigrams over multiple documents in python

Tags:

python

nlp

nltk

I have a set of text documents and want to count the number of bigrams over all text documents.

First, I create a list where each element is again a list representing the words in one specific document:

print(doc_clean)
# [['This', 'is', 'the', 'first', 'doc'], ['And', 'this', 'is', 'the', 'second'], ..]

Then, I extract the bigrams document-wise and store them in a list:

bigrams = []
for doc in doc_clean:
    bigrams.extend([(doc[i-1], doc[i]) 
                   for i in range(1, len(doc))])
print(bigrams)
# [('This', 'is'), ('is', 'the'), ..]

Now, I want to count the frequency of each unique bigram:

bigrams_freq = [(b, bigrams.count(b)) 
                for b in set(bigrams)]

Generally, this approach is working, but it is far too slow. The list of bigrams is quiet big with ~5mio entries in total and ~300k unique bigrams. On my laptop, the current approach is taking too much time for the analysis.

Thanks for helping me!

like image 590
ash bounty Avatar asked Feb 25 '26 19:02

ash bounty


1 Answers

You could try the following:

from collections import Counter
from nltk import word_tokenize 
from nltk.util import ngrams
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

doc_1 = 'Convolutional Neural Networks are very similar to ordinary Neural Networks from the previous chapter'
doc_2 = 'Convolutional Neural Networks take advantage of the fact that the input consists of images and they constrain the architecture in a more sensible way.'
doc_3 = 'In particular, unlike a regular Neural Network, the layers of a ConvNet have neurons arranged in 3 dimensions: width, height, depth.'
docs = [doc_1, doc_2, doc_3]
docs = (' '.join(filter(None, docs))).lower()

tokens = word_tokenize(docs)
tokens = [t for t in tokens if t not in stop_words]
word_l = WordNetLemmatizer()
tokens = [word_l.lemmatize(t) for t in tokens if t.isalpha()]

bi_grams = list(ngrams(tokens, 2)) 
counter = Counter(bi_grams)
counter.most_common(5)

Out[82]: 
[(('neural', 'network'), 4),
 (('convolutional', 'neural'), 2),
 (('network', 'similar'), 1),
 (('similar', 'ordinary'), 1),
 (('ordinary', 'neural'), 1)]
like image 182
KRKirov Avatar answered Feb 28 '26 10:02

KRKirov