Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Fuzzy Matching (FuzzyWuzzy) - Keep only Best Match

I'm trying to fuzzy match two csv files, each containing one column of names, that are similar but not the same.

My code so far is as follows:

import pandas as pd
from pandas import DataFrame
from fuzzywuzzy import process
import csv

save_file = open('fuzzy_match_results.csv', 'w')
writer = csv.writer(save_file, lineterminator = '\n')

def parse_csv(path):

with open(path,'r') as f:
    reader = csv.reader(f, delimiter=',')
    for row in reader:
        yield row


if __name__ == "__main__":
## Create lookup dictionary by parsing the products csv
data = {}
for row in parse_csv('names_1.csv'):
    data[row[0]] = row[0]

## For each row in the lookup compute the partial ratio
for row in parse_csv("names_2.csv"):
    #print(process.extract(row,data, limit = 100))
    for found, score, matchrow in process.extract(row, data, limit=100):
        if score >= 60:
            print('%d%% partial match: "%s" with "%s" ' % (score, row, found))
            Digi_Results = [row, score, found]
            writer.writerow(Digi_Results)


save_file.close()

The output is as follows:

Name11 , 90 , Name25 
Name11 , 85 , Name24 
Name11 , 65 , Name29

The script works fine. The output is as expected. But what I am looking for is only the best match.

Name11 , 90 , Name25
Name12 , 95 , Name21
Name13 , 98 , Name22

So I need to somehow drop the duplicated names in column 1, based on the highest value in column 2. It should be fairly straightforward, but I can't seem to figure it out. Any help would be appreciated.

like image 468
Kvothe Avatar asked Aug 17 '15 16:08

Kvothe


2 Answers

fuzzywuzzy's process.extract() returns the list in reverse sorted order , with the best match coming first.

so to find just the best match, you can set the limit argument as 1 , so that it only returns the best match, and if that is greater than 60 , you can write it to the csv, like you are doing now.

Example -

from fuzzywuzzy import process
## For each row in the lookup compute the partial ratio
for row in parse_csv("names_2.csv"):

    for found, score, matchrow in process.extract(row, data, limit=1):
        if score >= 60:
            print('%d%% partial match: "%s" with "%s" ' % (score, row, found))
            Digi_Results = [row, score, found]
            writer.writerow(Digi_Results)
like image 53
Anand S Kumar Avatar answered Sep 25 '22 22:09

Anand S Kumar


Several pieces of your code can be greatly simplified by using process.extractOne() from FuzzyWuzzy. Not only does it just return the top match, you can set a score threshold for it within the function call, rather than needing to perform a separate logical step, e.g.:

process.extractOne(row, data, score_cutoff = 60)

This function will return a tuple of the highest match plus the accompanying score if it finds a match satisfying the condition. It will return None otherwise.

like image 38
Michael Ohlrogge Avatar answered Sep 25 '22 22:09

Michael Ohlrogge