"Close Enough" Matching of Rows from Two Pandas Data Frames Comparing Multiple Columns

Question

I have two data frames, with two columns in common. I want to flag records in the 1st data frame that have a two-column match in the 2nd, where the second match is "fuzzy". Neither column is a unique key; values can be repeated in both of them. Imagine, for example, my matching columns are a city-column and a date-column.

Input dataframe 1 (df1):

city	date
New York	1/1/2024
New York	1/5/2024
Chicago	1/2/2024
Chicago	1/5/2024
Houston	1/3/2024

Input dataframe 2 (df2):

city	date
New York	1/2/2024
Chicago	1/5/2024

My desired output is to flag as True every record in df1 that has a record in df2 that exactly matches the city AND matches within one day of the date, with all other records flagged as False.

Output dataframe 1:

city	date	paired
New York	1/1/2024	True
New York	1/5/2024	False
Chicago	1/2/2024	False
Chicago	1/5/2024	True
Houston	1/3/2024	False

If I was seeking exact matches on both columns, the problem would be very straightforward: I would zip the two columns to be used for matching together, and apply an isin test inside of a where method:

df1['zipcol'] = list(zip(df1.city, df1.date))
df2['zipcol'] = list(zip(df2.city, df2.date))
df1['paired'] = np.where(df1['zipcol'].isin(df2['zipcol']), True, False)
#some overhead to drop the zipped columns

What is the most efficient way to perform this type of comparison, except with one (or more) criteria that is not exact, but defined in terms of the values from the 2nd data frame being within a range of values centered on values from the 1st?

Final note: I am not worried about the possibility of duplicate matches here. If one-to-many matches do occur, they can be handled in a post-processing step later on.

mozway · Accepted Answer

You need a merge_asof with direction='nearest' and 1 day of tolerance.

The important points are:

convert the dates to_datetime
maintain the index with reset_index (a merge doesn't maintain the index)
sort the dates with sort_values
add a temporary column to df2 (here with value 1) to act as flag for values present in df2.

df1['date'] = pd.to_datetime(df1['date'])
df2['date'] = pd.to_datetime(df2['date'])

df1['paired'] = (
    pd.merge_asof(
        df1.reset_index().sort_values(by='date'),
        df2.sort_values(by='date').assign(flag=1),
        on='date',
        by='city',
        direction='nearest',
        tolerance=pd.Timedelta('1d')
    )
    .set_index('index')['flag']
    .eq(1)
)

Output:

       city       date  paired
0  New York 2024-01-01    True
1  New York 2024-01-05   False
2   Chicago 2024-01-02   False
3   Chicago 2024-01-05    True
4   Houston 2024-01-03   False

"Close Enough" Matching of Rows from Two Pandas Data Frames Comparing Multiple Columns

Tags:

python

pandas

dataframe

NaiveBayesian

1 Answers

mozway

Recent Activity

Donate For Us

"Close Enough" Matching of Rows from Two Pandas Data Frames Comparing Multiple Columns

Tags:

python

pandas

dataframe

NaiveBayesian

1 Answers

mozway

Related questions

Recent Activity

Donate For Us