Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiprocessing fuzzy wuzzy string search - python

I am trying to do string match and bring the match id using fuzzy wuzzy in python. My dataset is huge, dataset1 = 1.8 million records, dataset2 = 1.6 million records.

What I tried so far,

First I tried to use record linkage package in python, unfortunately it ran out of memory when it build the multi index, so I moved to AWS with good machine power and successfully built it, however when I tried to run the comparison on it, it runs forever, I agree that its due to the number of comparison.

Then, I tried to do string match with fuzzy wuzzy and parallelise the process using dask package. And executed it on a sample data. It works fine, but I know the process will still take time as the search space is wide. I am looking for a way to add blocking or indexing on this piece of code.

test = pd.DataFrame({'Address1':['123 Cheese Way','234 Cookie Place','345 Pizza Drive','456 Pretzel Junction'],'city':['X','U','X','U']}) 
test2 = pd.DataFrame({'Address1':['123 chese wy','234 kookie Pl','345 Pizzza DR','456 Pretzel Junktion'],'city':['X','U','Z','Y'] , 'ID' : ['1','3','4','8']}) 

Here, I am trying to look for test.Address1 in test2.Address1 and bring its ID.

def fuzzy_score(str1, str2):
    return fuzz.token_set_ratio(str1, str2)

def helper(orig_string, slave_df):
    slave_df['score'] = slave_df.Address1.apply(lambda x: fuzzy_score(x,orig_string))
    #return my_value corresponding to the highest score
    return slave_df.ix[slave_df.score.idxmax(),'ID']

dmaster = dd.from_pandas(test, npartitions=24)
dmaster = dmaster.assign(ID_there=dmaster.Address1.apply(lambda x: helper(x, test2)))
dmaster.compute(get=dask.multiprocessing.get)

This works fine, however I am not sure how I can apply indexing on it by limiting the search space on the same city.

Lets say, I am creating an index on the city field and subset based on the city of the original string and pass that city to the helper function,

# sort the dataframe
test2.sort_values(by=['city'], inplace=True)
# set the index to be this and don't drop
test2.set_index(keys=['city'], drop=False,inplace=True)

I don't know how to do that ? Please advise. Thanks in advance.

like image 962
ds_user Avatar asked Sep 02 '25 15:09

ds_user


1 Answers

I prefer using fuzzywuzzy.process.extractOne. That compares a string to an iterable of strings.

def extract_one(col, other):
    # need this for dask later
    other = other.compute() if hasattr(other, 'compute') else other
    return pd.DataFrame([process.extractOne(x, other) for x in col],
                        columns=['Address1', 'score', 'idx'],
                        index=col.index)

extract_one(test.Address1, test2.Address1)

               Address1  score  idx
0          123 chese wy     92    0
1         234 kookie Pl     83    1
2         345 Pizzza DR     86    2
3  456 Pretzel Junktion     95    3

The idx is the index of the other passed to extract_one that matches closest. I would recommend having a meaningful index, to making joining the results later on easier.

For your second question, about filtering to cities, I would use a groupby and apply

gr1 = test.groupby('city')
gr2 = test2.groupby("city")

gr1.apply(lambda x: extract_one(x.Address1, 
gr2.get_group(x.name).Address1))

               Address1  score  idx
0          123 chese wy     92    0
1         234 kookie Pl     83    1
2         345 Pizzza DR     86    2
3  456 Pretzel Junktion     95    3

The only difference with dask is the need to specify a meta to the apply:

ddf1 = dd.from_pandas(test, 2)
ddf2 = dd.from_pandas(test2, 2)

dgr1 = ddf1.groupby('city')
dgr2 = ddf2.groupby('city')

meta = pd.DataFrame(columns=['Address1', 'score', 'idx'])
dgr1.apply(lambda x: extract_one(x.Address1, 

dgr2.get_group(x.name).Address1),
               meta=meta).compute()

             Address1  score  idx
city                             
U    0  234 kookie Pl     83    1
     1  234 kookie Pl     28    1
X    0   123 chese wy     92    0
     1   123 chese wy     28    0

Here's a notebook: https://gist.github.com/a932b3591346b898d6816a5efc2bc5ad

I'm curious to hear how the performance is. I'm assuming the actual string comparison done in fuzzy wuzzy will take the bulk of the time, but I'd love to hear back on how much overhead is spent in pandas and dask. Make sure you have the C extentions for computing the Levenshtein distance.

like image 103
TomAugspurger Avatar answered Sep 05 '25 16:09

TomAugspurger