I am trying to get the tf-idf values for Japanese words. The problem I am having is that sklearn TfidfVectorizer removes some Japanese characters, which I want to keep, as stop words.
The following is the example:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(stop_words = None)
words_list = ["歯","が","痛い"]
tfidf_matrix = tf.fit_transform(words_list)
feature_names = tf.get_feature_names()
print (feature_names)
The output is:['痛い']
However, I want to keep all those three characters in the list. I believe TfidfVectorizer removes characters with length of 1 as stop words. How could I deactivate the default stop words feature and keep all characters?
You can change the token_pattern parameter from (?u)\\b\\w\\w+\\b (default) to (?u)\\b\\w\\w*\\b; The default matches token that has two or more word characters (in case you are not familiar with regex, + means one or more, so \\w\\w+ matches word with two or more word characters; * on the other hand means zero or more, \\w\\w* will thus match word with one or more characters):
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(stop_words = None, token_pattern='(?u)\\b\\w\\w*\\b')
words_list = ["歯","が","痛い"]
tfidf_matrix = tf.fit_transform(words_list)
feature_names = tf.get_feature_names()
print(feature_names)
# ['が', '歯', '痛い']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With