Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: how many similar words in string?

Tags:

python

string

I have some ugly strings similar to these:

   string1 = 'Fantini, Rauch, C.Straus, Priuli, Bertali: 'Festival Mass at the Imperial Court of Vienna, 1648' (Yorkshire Bach Choir & Baroque Soloists + Baroque Brass of London/Seymour)'
   string2 = 'Vinci, Leonardo {c.1690-1730}: Arias from Semiramide Riconosciuta, Didone Abbandonata, La Caduta dei Decemviri, Lo Cecato Fauzo, La Festa de Bacco, Catone in Utica. (Maria Angeles Peters sop. w.M.Carraro conducting)'

I would like a library or algorithm that will give me a percentage of how many words they have in common, while excluding special characters such as ',' and ':' and ''' and '{' etc.

I know of the Levenshtein algorithm. However, this compares numbers of similar CHARACTERS, whereas I would like to compare how many WORDS they have in common

like image 919
Alex Gordon Avatar asked Nov 20 '25 17:11

Alex Gordon


1 Answers

Regex could easily give you all the words:

import re
s1 = "Fantini, Rauch, C.Straus, Priuli, Bertali: 'Festival Mass at the Imperial Court of Vienna, 1648' (Yorkshire Bach Choir & Baroque Soloists + Baroque Brass of London/Seymour)"
s2 = "Vinci, Leonardo {c.1690-1730}: Arias from Semiramide Riconosciuta, Didone Abbandonata, La Caduta dei Decemviri, Lo Cecato Fauzo, La Festa de Bacco, Catone in Utica. (Maria Angeles Peters sop. w.M.Carraro conducting)"
s1w = re.findall('\w+', s1.lower())
s2w = re.findall('\w+', s2.lower())

collections.Counter (Python 2.7+) can quickly count up the number of times a word occurs.

from collections import Counter
s1cnt = Counter(s1w)
s2cnt = Counter(s2w)

A very crude comparison could be done through set.intersection or difflib.SequenceMatcher, but it sounds like you would want to implement a Levenshtein algorithm that deals with words, where you could use those two lists.

common = set(s1w).intersection(s2w) 
# returns set(['c'])

import difflib
common_ratio = difflib.SequenceMatcher(None, s1w, s2w).ratio()
print '%.1f%% of words common.' % (100*common_ratio)

Prints: 3.4% of words similar.

like image 187
Nick T Avatar answered Nov 23 '25 05:11

Nick T



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!