I am implementing language model as a personal challenge, as a part of simple web application. Still I've avoided using NLTK, however was faced with MemoryError with enough big corpus (vocabulary about 50000 and amount of trigrams was about 440000 - I've used standard python dictionary and after tried numpy array to store all word-ngram probabilities as matrix). So it seems the solution is to use more efficient data structure, something that was mentioned here train a language model using Google Ngrams Or store model on disk. In General, could you advise what approach could be better for storing ngram model (in memory or disk space) and after using it as a part of web-app?
You are not the first to try an N-gram language model and you won't be the last.
As another solution notes, we can store a large dictionary in a tree-type structure so that all n-grams that share the first k words (k < n) use the same nodes. It is easy to implement something akin to the VocabTreeNode that I use in the following post:
Building a Vocab Tree
If you have, a priori stored the n-grams in this format, it is often prudent to prune it. I.e., if you append a count variable to the model, we can remove elements with low frequency. If your data is messy, then low-frequency n-grams often correspond to incorrectly spelled words or real-word errors.
Even with the efficiency of the tree structure, it is difficult to efficiently extend the above for n-grams for n > 3 unless you have a small vocab like characters. For moderately high values of n like n=6, a popular solution uses (modified) Kneser-Ney smoothing. A library like
https://github.com/kpu/kenlm
is efficient and has a python interface. Although for large corpora, pruning is still recommended when building your own model as well as Trie-like compression to create a binary from the ARPA model. This produces the log-probabilities as a score. My first 6-gram model was 11Gb from a 7Gb corpus. I pruned and compressed this to get a reasonable 400Mb model. It was very fast once loaded. By judiciously focusing your corpus, it should be easy to get a model that works for a web application.
Similar purely python-based methods for Kneser-Ney based interpolation are available via a pip install. The canonical paper for implementing a model yourself can be found here:
An empirical study of smoothing techniques for language modeling
For very large n where the above methods fail, you could use your corpus to create an embedding and use a recurrent neural network like an LSTM or GRU based network to output either the log-probabilities of the tuple or the probabilities themselves. It is not that hard if you have a handle of Keras or Pytorch. Something like the structure of a 2 layer LSTM with attention should work well. There a chasm of things one could be doing in the neural network direction to encode n-grams.
I'll split my answer into two parts, the first being why storing it in the form of dictionaries is a bad idea, and the second is the optimal data structure for storing ngrams.
Consider storing the following words in a dictionary: "Bond", "Boat", "Build", the size of a dictionary containing these keys hashed to some integer would roughly be proportional to the number of words + their characters. So, we're technically spending additional space to store certain letters which may repeat themselves. Now, the problem becomes apparent, we're spending a lot of additional memory in storing parts of strings that we need not re-store.
The question remains, what is the ideal data structure that can be used here. The requisites for this data structure are:
If we consider which data structure fits these requirements, one, which immediately comes to mind is a Trie, or, more precisely a Prefix Trie. The inherent structure of the Trie is helpful because we would be saving space on single characters that we would otherwise store several times over. With a small set of words like in my example above, the problem is not very damning. However, as the length of our set of words increases, we will soon run out of space by using a Hash Table/ Dictionary.
Hope this helped.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With