I have two data frames, with two columns in common. I want to flag records in the 1st data frame that have a two-column match in the 2nd, where the second match is "fuzzy". Neither column is a unique key; values can be repeated in both of them. Imagine, for example, my matching columns are a city-column and a date-column.
Input dataframe 1 (df1
):
city | date |
---|---|
New York | 1/1/2024 |
New York | 1/5/2024 |
Chicago | 1/2/2024 |
Chicago | 1/5/2024 |
Houston | 1/3/2024 |
Input dataframe 2 (df2
):
city | date |
---|---|
New York | 1/2/2024 |
Chicago | 1/5/2024 |
My desired output is to flag as True
every record in df1
that has a record in df2
that exactly matches the city AND matches within one day of the date, with all other records flagged as False
.
Output dataframe 1:
city | date | paired |
---|---|---|
New York | 1/1/2024 | True |
New York | 1/5/2024 | False |
Chicago | 1/2/2024 | False |
Chicago | 1/5/2024 | True |
Houston | 1/3/2024 | False |
If I was seeking exact matches on both columns, the problem would be very straightforward: I would zip the two columns to be used for matching together, and apply an isin
test inside of a where
method:
df1['zipcol'] = list(zip(df1.city, df1.date))
df2['zipcol'] = list(zip(df2.city, df2.date))
df1['paired'] = np.where(df1['zipcol'].isin(df2['zipcol']), True, False)
#some overhead to drop the zipped columns
What is the most efficient way to perform this type of comparison, except with one (or more) criteria that is not exact, but defined in terms of the values from the 2nd data frame being within a range of values centered on values from the 1st?
Final note: I am not worried about the possibility of duplicate matches here. If one-to-many matches do occur, they can be handled in a post-processing step later on.
You need a merge_asof
with direction='nearest'
and 1 day of tolerance
.
The important points are:
to_datetime
reset_index
(a merge doesn't maintain the index)sort_values
df2
(here with value 1
) to act as flag for values present in df2
.df1['date'] = pd.to_datetime(df1['date'])
df2['date'] = pd.to_datetime(df2['date'])
df1['paired'] = (
pd.merge_asof(
df1.reset_index().sort_values(by='date'),
df2.sort_values(by='date').assign(flag=1),
on='date',
by='city',
direction='nearest',
tolerance=pd.Timedelta('1d')
)
.set_index('index')['flag']
.eq(1)
)
Output:
city date paired
0 New York 2024-01-01 True
1 New York 2024-01-05 False
2 Chicago 2024-01-02 False
3 Chicago 2024-01-05 True
4 Houston 2024-01-03 False
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With