I want to calculate the cosine similarity of two lists like following:
A = [u'home (private)', u'bank', u'bank', u'building(condo/apartment)','factory']
B = [u'home (private)', u'school', u'bank', u'shopping mall']
I know the cosine similarity of A and B should be
3/(sqrt(7)*sqrt(4)).
I try to reform the lists into forms like 'home bank bank building factory', which looks like a sentence, however, some elements (e.g. home (private)) have blank space in itself and some elements have brackets so I find it difficult to calculate the word occurrence.
Do you know how to calculate the word occurrence in this complicated list, so that for list B, word occurrence can be represented as
{'home (private):1, 'school':1, 'bank': 1, 'shopping mall':1}? 
Or do you know how to calculate the cosine similarity of these two lists?
Thank you very much
Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Similarity = (A.B) / (||A||. ||B||) where A and B are vectors.
From Python: tf-idf-cosine: to find document similarity , it is possible to calculate document similarity using tf-idf cosine.
Python sort() method and == operator to compare lists We can club the Python sort() method with the == operator to compare two lists. Python sort() method is used to sort the input lists with a purpose that if the two input lists are equal, then the elements would reside at the same index positions.
from collections import Counter
# word-lists to compare
a = [u'home (private)', u'bank', u'bank', u'building(condo/apartment)','factory']
b = [u'home (private)', u'school', u'bank', u'shopping mall']
# count word occurrences
a_vals = Counter(a)
b_vals = Counter(b)
# convert to word-vectors
words  = list(a_vals.keys() | b_vals.keys())
a_vect = [a_vals.get(word, 0) for word in words]        # [0, 0, 1, 1, 2, 1]
b_vect = [b_vals.get(word, 0) for word in words]        # [1, 1, 1, 0, 1, 0]
# find cosine
len_a  = sum(av*av for av in a_vect) ** 0.5             # sqrt(7)
len_b  = sum(bv*bv for bv in b_vect) ** 0.5             # sqrt(4)
dot    = sum(av*bv for av,bv in zip(a_vect, b_vect))    # 3
cosine = dot / (len_a * len_b)                          # 0.5669467
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With