auto-correct the words from the list in python

Question

I want to auto-correct the words which are in my list.

Say I have a list

kw = ['tiger','lion','elephant','black cat','dog']

I want to check if these words appeared in my sentence. If they are wrongly spelled I want to correct them. I don't intend to touch other words except from the given list.

Now I have list of str

s = ["I saw a tyger","There are 2 lyons","I mispelled Kat","bulldogs"]

Expected output:

['tiger','lion',None,'dog']

My Efforts:

import difflib

op = [difflib.get_close_matches(i,kw,cutoff=0.5) for i in s]
print(op)

My Output:

[[], [], [], ['dog']]

The problem with above code is I want to compare entire sentence and my kw list can have more than 1 word(upto 4-5 words).

If I lower the cutoff value it starts returning the words which is should not.

So even if I plan to create bigrams, trigrams from given sentence it would consume a lot of time.

So is there way to implement this?

I have explored few more libraries like autocorrect, hunspell etc. but no success.

PascalVKooten · Accepted Answer

You could implement something based of levenshtein distance.

It's interesting to note elasticsearch's implementation: https://www.elastic.co/guide/en/elasticsearch/guide/master/fuzziness.html

Clearly, bieber is a long way from beaver—they are too far apart to be considered a simple misspelling. Damerau observed that 80% of human misspellings have an edit distance of 1. In other words, 80% of misspellings could be corrected with a single edit to the original string.

Elasticsearch supports a maximum edit distance, specified with the fuzziness parameter, of 2.

Of course, the impact that a single edit has on a string depends on the length of the string. Two edits to the word hat can produce mad, so allowing two edits on a string of length 3 is overkill. The fuzziness parameter can be set to AUTO, which results in the following maximum edit distances:

0 for strings of one or two characters

1 for strings of three, four, or five characters

2 for strings of more than five characters

I like to use pyxDamerauLevenshtein myself.

pip install pyxDamerauLevenshtein

So you could do a simple implementation like:

keywords = ['tiger','lion','elephant','black cat','dog']    

from pyxdameraulevenshtein import damerau_levenshtein_distance


def correct_sentence(sentence):
    new_sentence = []
    for word in sentence.split():
        budget = 2
        n = len(word)
        if n < 3:
            budget = 0
        elif 3 <= n < 6:
            budget = 1            
        if budget:            
            for keyword in keywords:        
                if damerau_levenshtein_distance(word, keyword) <= budget:
                    new_sentence.append(keyword)
                    break
            else:
                new_sentence.append(word)
        else:
            new_sentence.append(word)        
    return " ".join(new_sentence)

Just make sure you use a better tokenizer or this will get messy, but you get the point. Also note that this is unoptimized, and will be really slow with a lot of keywords. You should implement some kind of bucketing to not match all words with all keywords.

yatu · Answer

Here is one way using difflib.SequenceMatcher. The SequenceMatcher class allows you to measure sentence similarity with its ratio method, you only need to provide a suitable threshold in order to keep words with a ratio that falls above the given threshold:

def find_similar_word(s, kw, thr=0.5):
    from difflib import SequenceMatcher
    out = []
    for i in s:
        f = False
        for j in i.split():
            for k in kw:
                if SequenceMatcher(a=j, b=k).ratio() > thr:
                    out.append(k)
                    f = True
                if f:
                    break
            if f:
                break
        else:
            out.append(None)    
    return out

Output

find_similar_word(s, kw)
['tiger', 'lion', None, 'dog']

auto-correct the words from the list in python

Tags:

python

python-3.x

difflib

autocorrect

Sociopath

2 Answers

PascalVKooten

yatu

Recent Activity

Donate For Us

auto-correct the words from the list in python

Tags:

python

python-3.x

difflib

autocorrect

Sociopath

2 Answers

PascalVKooten

yatu

Related questions

Recent Activity

Donate For Us