python

Question

I want to get most relevant words from a text in order to prepare a tag cloud.

I used CountVectoriser from scikit-learn package:

cv = CountVectorizer(min_df=1, charset_error="ignore",
    stop_words="english", max_features=200)

This is nice, because it gives me words and frequences:

counts = cv.fit_transform([text]).toarray().ravel()
words = np.array(cv.get_feature_names())

I can filter non frequent words out:

words = words[counts > 1]
counts = counts[counts > 1]

as well as words, that are numbers:

words = words[np.array(map(lambda x: x.isalpha(), words))]
counts = counts[np.array(map(lambda x: x.isalpha(), words))]

But it's still not perfect...

My questions are:

How to filter out verbs?
How to perfeorm stemming to get rid of different forms of the same word?
How to call CountVectoriser to filter out two letter words?

Please also note:

I'm fine with nltk but answer like "you should try nltk" is not an answer, give me a code, please.
I don't want to use Bayesian classifier and other techniques, that require training a model. I don't have time for that and I don't have examples to train the classifier.
Language is English

ogrisel · Accepted Answer

1- How to filter out verbs?

Depends on the language(s) you want to support. You will need a good sentence + word tokenizer pair and a part of speech tagger. All three components are commonly implemented using machine learning models (although you can get good results with rule based sentence and word tokenizers). If you want to support English only, you can find pre-trained models in nltk but I am not expert and you will have to read the documentation and tutorials :)

Once you know how to split a text into sentences and words and identify and remove the verbs you can wrap that as a python function and pass it to the CountVectorizer constructor, see below.

2- How to perfeorm stemming to get rid of different forms of the same word?

You will have to pass a custom tokenizer python callable to the CountVectorizer constructor to both handle token extraction, stemming and optionally filtering at the same time. This is explained in the documentation.

For the stemming itself, it depends on the language you want to support but you can start with http://nltk.org/api/nltk.stem.html

There is a pull request to make it more natural to plug a stemmer:

https://github.com/scikit-learn/scikit-learn/pull/1537

3- How to call CountVectorizer to filter out two letter words?

You can change the default regular expression used for tokenization:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> CountVectorizer().token_pattern
u'(?u)\b\w\w+\b'
>>> CountVectorizer(token_pattern=u'(?u)\b\w{3,}\b').build_tokenizer()(
...    'a ab abc abcd abcde xxxxxxxxxxxxxxxxxxxxxx')
['abc', 'abcd', 'abcde', 'xxxxxxxxxxxxxxxxxxxxxx']
>>> CountVectorizer(token_pattern=u'(?u)\b\w{3,9}\b').build_tokenizer()(
...     'a ab abc abcd abcde xxxxxxxxxxxxxxxxxxxxxx')
['abc', 'abcd', 'abcde']

But in your case you might want to replace the tokenizer as a whole. You can still have a look at the source of the default implementation.

One remark though: to build a tag cloud, it's probably much easier to use nltk directly and the collections.Counter class from the python standard library. sklearn does not give you much for this task.

python - Picking most relevant words for a tag cloud from text using nltk and scikit-learn

Tags:

nltk

text-mining

scikit-learn

data-mining

mnowotka

1 Answers

ogrisel

Recent Activity

Donate For Us