Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to deactivate the default stop words feature for sklearn TfidfVectorizer

I am trying to get the tf-idf values for Japanese words. The problem I am having is that sklearn TfidfVectorizer removes some Japanese characters, which I want to keep, as stop words.

The following is the example:

from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(stop_words = None)

words_list = ["歯","が","痛い"]
tfidf_matrix =  tf.fit_transform(words_list)
feature_names = tf.get_feature_names() 
print (feature_names)

The output is:['痛い']

However, I want to keep all those three characters in the list. I believe TfidfVectorizer removes characters with length of 1 as stop words. How could I deactivate the default stop words feature and keep all characters?

like image 790
Daiki Akiyoshi Avatar asked Nov 02 '25 23:11

Daiki Akiyoshi


1 Answers

You can change the token_pattern parameter from (?u)\\b\\w\\w+\\b (default) to (?u)\\b\\w\\w*\\b; The default matches token that has two or more word characters (in case you are not familiar with regex, + means one or more, so \\w\\w+ matches word with two or more word characters; * on the other hand means zero or more, \\w\\w* will thus match word with one or more characters):

from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(stop_words = None, token_pattern='(?u)\\b\\w\\w*\\b')
​
words_list = ["歯","が","痛い"]
tfidf_matrix =  tf.fit_transform(words_list)
feature_names = tf.get_feature_names() 
print(feature_names)
# ['が', '歯', '痛い']
like image 140
Psidom Avatar answered Nov 04 '25 12:11

Psidom



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!