Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python string comparison similarity

Tags:

python

I am trying to compare two lists of data which has some free text denoting the same object. example

List 1 ['abc LLC','xyz, LLC']
List 2 ['abc , LLC','xyz LLC']

It is a simple example but the problem is there can be many changes like changes in case or adding some "." in between. Is there any python package that can do the comparison and give a measure of similarity?

like image 812
Raman Narayanan Avatar asked May 10 '26 08:05

Raman Narayanan


1 Answers

You could use an implementation of the Levenshtein Distance algorithm for non-precise string matching, for instance this one from Wikibooks.

Another option would be to, for instance, fold everything to lower case, remove spaces, etc. prior to raw comparison -- this of course depends on your use case:

import string, unicodedata
allowed = string.letters + string.digits
def fold(s):
  s = unicodedata.normalize("NFKD", unicode(s).lower()).encode("ascii", "ignore")
  s = "".join(c for c in s if c in allowed)
  return s

for example in ['abc LLC','xyz, LLC', 'abc , LLC','xyz LLC']:
  print "%r -> %r" % (example, fold(example))

would print

'abc LLC' -> 'abcllc'
'xyz, LLC' -> 'xyzllc'
'abc , LLC' -> 'abcllc'
'xyz LLC' -> 'xyzllc'
like image 173
AKX Avatar answered May 11 '26 20:05

AKX



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!