I'm trying to compute a simple word frequency using scikit-learn's CountVectorizer.
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
texts=["dog cat fish","dog cat cat","fish bird","bird"]
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)
print cv.vocabulary_
{u'bird': 0, u'cat': 1, u'dog': 2, u'fish': 3}
I was expecting it to return {u'bird': 2, u'cat': 3, u'dog': 2, u'fish': 2}.
Word Counts with CountVectorizer You can use it as follows: Create an instance of the CountVectorizer class. Call the fit() function in order to learn a vocabulary from one or more documents. Call the transform() function on one or more documents as needed to encode each as a vector.
Count vectorizer creates a matrix with documents and token counts (bag of terms/tokens) therefore it is also known as document term matrix (dtm).
CountVectorizer will tokenize the data and split it into chunks called n-grams, of which we can define the length by passing a tuple to the ngram_range argument. For example, 1,1 would give us unigrams or 1-grams such as “whey” and “protein”, while 2,2 would give us bigrams or 2-grams, such as “whey protein”.
cv.vocabulary_ in this instance is a dict, where the keys are the words (features) that you've found and the values are indices, which is why they're 0, 1, 2, 3. It's just bad luck that it looked similar to your counts :)
You need to work with the cv_fit object to get the counts
from sklearn.feature_extraction.text import CountVectorizer
texts = ["dog cat fish", "dog cat cat", "fish bird", "bird"]
cv = CountVectorizer()
cv_fit = cv.fit_transform(texts)
print(cv.get_feature_names())
print(cv_fit.toarray())
# ["bird", "cat", "dog", "fish"]
# [[0 1 1 1]
# [0 2 1 0]
# [1 0 0 1]
# [1 0 0 0]]
Each row in the array is one of your original documents (strings), each column is a feature (word), and the element is the count for that particular word and document. You can see that if you sum each column you'll get the correct number
print(cv_fit.toarray().sum(axis=0))
# [2 3 2 2]
Honestly though, I'd suggest using collections.Counter or something from NLTK, unless you have some specific reason to use scikit-learn, as it'll be simpler.
cv_fit.toarray().sum(axis=0) definitely gives the correct result, but it will be much faster to perform the sum on the sparse matrix and then transform it to an array:
np.asarray(cv_fit.sum(axis=0))
We are going to use the zip method to make dict from a list of words and list of their counts
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
texts=["dog cat fish", "dog cat cat", "fish bird", "bird"]
cv = CountVectorizer()
cv_fit = cv.fit_transform(texts)
word_list = cv.get_feature_names()
count_list = cv_fit.toarray().sum(axis=0)
The outputs are following:
>> print word_list
['bird', 'cat', 'dog', 'fish']
>> print count_list
[2 3 2 2]
>> print dict(zip(word_list,count_list))
{'fish': 2, 'dog': 2, 'bird': 2, 'cat': 3}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With