I would like to see similarity between lists using TFIDFVectorizer
and CountVectorizer
.
I have lists like below:
list1 = [['i','love','machine','learning','its','awesome'],
['i', 'love', 'coding', 'in', 'python'],
['i', 'love', 'building', 'chatbots']]
list2 = ['i', 'love', 'chatbots']
I would like to see similarity between list1[0]
and list2
, list1[1]
and list2
, list1[2]
and list2
.
Expecting output should be like [0.99 , 0.67, 0.54]
From the docs TfidfVectorizer
is:
"Equivalent to CountVectorizer followed by TfidfTransformer."
Here is the code
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
"i love machine learning its awesome",
"i love coding in python",
"i love building chatbots",
"i love chatbots"
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
# print(vectorizer.get_feature_names())
arr = X.toarray()
And the answers using cosine similarity
# similarity of yours `list1[0] and list2`
np.dot(arr[0], arr[3]) # gives ~0.139
# similarity of yours `list1[1] and list2`
np.dot(arr[1], arr[3]) # gives ~0.159
# similarity of yours `list1[2] and list2`
np.dot(arr[2], arr[3]) # gives ~0.687
or using jaccard similarity and CountVectorizer
I think is closer to what you are expecting
from sklearn.metrics import jaccard_score
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
arr = X.toarray()
jaccard_score(arr[0], arr[3]) # gives 0.5
jaccard_score(arr[1], arr[3]) # gives 0.6
jaccard_score(arr[2], arr[3]) # gives 0.9
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With