I am using a dataset of company names with that may contains not identical duplicates.
The list may contains : company A but also c.o.m.p.a.n.y A or comp A
Is there any python script using NLP for example that can find similar names from a dataset.
Thanks in advance
You can use spacy to get similarities between 2 texts.
import spacy
nlp = spacy.load("en_core_web_md") # make sure to use larger package!
doc1 = nlp("Coca-Cola")
doc2 = nlp("Pepsi")
doc3 = nlp("Company Coca-Cola")
doc4 = nlp("Company Pepsi-Cola")
print(doc1, "<->", doc2, doc1.similarity(doc2))
print(doc3, "<->", doc4, doc3.similarity(doc4))
With following similarities
Coca-Cola <-> Pepsi 0.6684898494102074
Company Coca-Cola <-> Company Pepsi-Cola 0.934960639746236
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With