Python Fuzzy Matching (FuzzyWuzzy) - Keep only Best Match

Question

I'm trying to fuzzy match two csv files, each containing one column of names, that are similar but not the same.

My code so far is as follows:

import pandas as pd
from pandas import DataFrame
from fuzzywuzzy import process
import csv

save_file = open('fuzzy_match_results.csv', 'w')
writer = csv.writer(save_file, lineterminator = '
')

def parse_csv(path):

with open(path,'r') as f:
    reader = csv.reader(f, delimiter=',')
    for row in reader:
        yield row


if __name__ == "__main__":
## Create lookup dictionary by parsing the products csv
data = {}
for row in parse_csv('names_1.csv'):
    data[row[0]] = row[0]

## For each row in the lookup compute the partial ratio
for row in parse_csv("names_2.csv"):
    #print(process.extract(row,data, limit = 100))
    for found, score, matchrow in process.extract(row, data, limit=100):
        if score >= 60:
            print('%d%% partial match: "%s" with "%s" ' % (score, row, found))
            Digi_Results = [row, score, found]
            writer.writerow(Digi_Results)


save_file.close()

The output is as follows:

Name11 , 90 , Name25 
Name11 , 85 , Name24 
Name11 , 65 , Name29

The script works fine. The output is as expected. But what I am looking for is only the best match.

Name11 , 90 , Name25
Name12 , 95 , Name21
Name13 , 98 , Name22

So I need to somehow drop the duplicated names in column 1, based on the highest value in column 2. It should be fairly straightforward, but I can't seem to figure it out. Any help would be appreciated.

Anand S Kumar · Accepted Answer

fuzzywuzzy's process.extract() returns the list in reverse sorted order , with the best match coming first.

so to find just the best match, you can set the limit argument as 1 , so that it only returns the best match, and if that is greater than 60 , you can write it to the csv, like you are doing now.

Example -

from fuzzywuzzy import process
## For each row in the lookup compute the partial ratio
for row in parse_csv("names_2.csv"):

    for found, score, matchrow in process.extract(row, data, limit=1):
        if score >= 60:
            print('%d%% partial match: "%s" with "%s" ' % (score, row, found))
            Digi_Results = [row, score, found]
            writer.writerow(Digi_Results)

Michael Ohlrogge · Answer

Several pieces of your code can be greatly simplified by using process.extractOne() from FuzzyWuzzy. Not only does it just return the top match, you can set a score threshold for it within the function call, rather than needing to perform a separate logical step, e.g.:

process.extractOne(row, data, score_cutoff = 60)

This function will return a tuple of the highest match plus the accompanying score if it finds a match satisfying the condition. It will return None otherwise.

Python Fuzzy Matching (FuzzyWuzzy) - Keep only Best Match

Tags:

python

string-matching

fuzzy-search

fuzzywuzzy

Kvothe

2 Answers

Anand S Kumar

Michael Ohlrogge

Recent Activity

Donate For Us

Python Fuzzy Matching (FuzzyWuzzy) - Keep only Best Match

Tags:

python

string-matching

fuzzy-search

fuzzywuzzy

Kvothe

2 Answers

Anand S Kumar

Michael Ohlrogge

Related questions

Recent Activity

Donate For Us