I'm looking for faster alternatives to NLTK to analyze big corpora and do basic things like calculating frequencies, PoS tagging etc... SpaCy seems great and easy to use in many ways, but I can't find any built-in function to count the frequency of a specific word for example. I've looked at the spaCy documentation, but I can't find a straightforward way to do it. Am I missing something?
What I would like would be the NLTK equivalent of:
tokens.count("word") #where tokens is the tokenized text in which the word is to be counted
In NLTK, the above code would tell me that in my text, the word "word" appears X number of times.
Note that I've come by the count_by function, but it doesn't seem to do what I'm looking for.
I use spaCy for frequency counts in corpora quite often. This is what I usually do:
import spacy
nlp = spacy.load("en_core_web_sm")
list_of_words = ['run', 'jump', 'catch']
def word_count(string):
    words_counted = 0
    my_string = nlp(string)
    for token in my_string:
        # actual word
        word = token.text
        # lemma
        lemma_word = token.lemma_
        # part of speech
        word_pos = token.pos_
        if lemma_word in list_of_words:
            words_counted += 1
            print(lemma_word)
    return words_counted
sentence = "I ran, jumped, and caught the ball."
words_counted = word_count(sentence)
print(words_counted)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With