Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"Close Enough" Matching of Rows from Two Pandas Data Frames Comparing Multiple Columns

I have two data frames, with two columns in common. I want to flag records in the 1st data frame that have a two-column match in the 2nd, where the second match is "fuzzy". Neither column is a unique key; values can be repeated in both of them. Imagine, for example, my matching columns are a city-column and a date-column.

Input dataframe 1 (df1):

city date
New York 1/1/2024
New York 1/5/2024
Chicago 1/2/2024
Chicago 1/5/2024
Houston 1/3/2024

Input dataframe 2 (df2):

city date
New York 1/2/2024
Chicago 1/5/2024

My desired output is to flag as True every record in df1 that has a record in df2 that exactly matches the city AND matches within one day of the date, with all other records flagged as False.

Output dataframe 1:

city date paired
New York 1/1/2024 True
New York 1/5/2024 False
Chicago 1/2/2024 False
Chicago 1/5/2024 True
Houston 1/3/2024 False

If I was seeking exact matches on both columns, the problem would be very straightforward: I would zip the two columns to be used for matching together, and apply an isin test inside of a where method:

df1['zipcol'] = list(zip(df1.city, df1.date))
df2['zipcol'] = list(zip(df2.city, df2.date))
df1['paired'] = np.where(df1['zipcol'].isin(df2['zipcol']), True, False)
#some overhead to drop the zipped columns

What is the most efficient way to perform this type of comparison, except with one (or more) criteria that is not exact, but defined in terms of the values from the 2nd data frame being within a range of values centered on values from the 1st?

Final note: I am not worried about the possibility of duplicate matches here. If one-to-many matches do occur, they can be handled in a post-processing step later on.

like image 545
NaiveBayesian Avatar asked Oct 20 '25 11:10

NaiveBayesian


1 Answers

You need a merge_asof with direction='nearest' and 1 day of tolerance.

The important points are:

  • convert the dates to_datetime
  • maintain the index with reset_index (a merge doesn't maintain the index)
  • sort the dates with sort_values
  • add a temporary column to df2 (here with value 1) to act as flag for values present in df2.
df1['date'] = pd.to_datetime(df1['date'])
df2['date'] = pd.to_datetime(df2['date'])

df1['paired'] = (
    pd.merge_asof(
        df1.reset_index().sort_values(by='date'),
        df2.sort_values(by='date').assign(flag=1),
        on='date',
        by='city',
        direction='nearest',
        tolerance=pd.Timedelta('1d')
    )
    .set_index('index')['flag']
    .eq(1)
)

Output:

       city       date  paired
0  New York 2024-01-01    True
1  New York 2024-01-05   False
2   Chicago 2024-01-02   False
3   Chicago 2024-01-05    True
4   Houston 2024-01-03   False
like image 150
mozway Avatar answered Oct 22 '25 00:10

mozway