I'm trying to find k most common n-grams from a large corpus. I've seen lots of places suggesting the naïve approach - simply scanning through the entire corpus and keeping a dictionary of the count of all n-grams. Is there a better way to do this?
Based on the results, the model performs at its best with the n-gram range of (1,5). This means that training the model with n-grams ranging from unigrams to 5-grams help achieve optimal results, but larger n-grams only result in more sparse input features, which hampers model performance.
Well, in Natural Language Processing, or NLP for short, n-grams are used for a variety of things. Some examples include auto completion of sentences (such as the one we see in Gmail these days), auto spell check (yes, we can do that as well), and to a certain extent, we can check for grammar in a given sentence.
An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n − 1)–order Markov model.
N-grams are the basic features commonly used in sequence-based malicious code detection methods in computer virology research.
In Python, using NLTK:
$ wget http://norvig.com/big.txt
$ python
>>> from collections import Counter
>>> from nltk import ngrams
>>> bigtxt = open('big.txt').read()
>>> ngram_counts = Counter(ngrams(bigtxt.split(), 2))
>>> ngram_counts.most_common(10)
[(('of', 'the'), 12422), (('in', 'the'), 5741), (('to', 'the'), 4333), (('and', 'the'), 3065), (('on', 'the'), 2214), (('at', 'the'), 1915), (('by', 'the'), 1863), (('from', 'the'), 1754), (('of', 'a'), 1700), (('with', 'the'), 1656)]
In Python, native (see Fast/Optimize N-gram implementations in python):
>>> import collections
>>> def ngrams(text, n=2):
...     return zip(*[text[i:] for i in range(n)])
>>> ngram_counts = collections.Counter(ngrams(bigtxt.split(), 2))
>>> ngram_counts.most_common(10)
    [(('of', 'the'), 12422), (('in', 'the'), 5741), (('to', 'the'), 4333), (('and', 'the'), 3065), (('on', 'the'), 2214), (('at', 'the'), 1915), (('by', 'the'), 1863), (('from', 'the'), 1754), (('of', 'a'), 1700), (('with', 'the'), 1656)]
In Julia, see Generate ngrams with Julia
import StatsBase: countmap
import Iterators: partition
bigtxt = readstring(open("big.txt"))
ngram_counts = countmap(collect(partition(split(bigtxt), 2, 1)))
Rough timing:
$ time python ngram-test.py # With NLTK.
real    0m3.166s
user    0m2.274s
sys 0m0.528s
$ time python ngram-native-test.py 
real    0m1.521s
user    0m1.317s
sys 0m0.145s
$ time julia ngram-test.jl 
real    0m3.573s
user    0m3.188s
sys 0m0.306s
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With