Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to join two dataframes when only some dates in one dataframe is present between two other dates in other dataframe?

I have two dataframes:

df_A = pd.DataFrame({'start_date':['2021-04-01 00:00:00','2021-05-01 00:00:00', '2021-05-02 00:00:00'],'end_date':['2021-04-01 00:11:00','2021-05-01 00:06:00', '2021-05-03 00:00:00'], 'eventname':['birthday', 'wedding', 'birthday'] })
df_B = pd.DataFrame({'event_date':['2021-04-01 00:06:00','2021-05-01 00:03:00', '2021-05-04 00:00:00'],'price':[100,200,500]})
df_A['end_date'] = pd.to_datetime(df_A.end_date)
df_A['start_date'] = pd.to_datetime(df_A.start_date)
df_B['event_date'] = pd.to_datetime(df_B.event_date)

df_A

    start_date  end_date    eventname
0   2021-04-01  2021-04-01 00:11:00 birthday
1   2021-05-01  2021-05-01 00:06:00 wedding
2   2021-05-02  2021-05-03 00:00:00 birthday

df_B

event_date  price
0   2021-04-01 00:06:00 100
1   2021-05-01 00:03:00 200
2   2021-05-04 00:00:00 500

I want to join them such that event_date column lies in between the start_date and end_date column of df_A. This is my code:

df_A.index = pd.IntervalIndex.from_arrays(df_A['start_date'],df_A['end_date'],closed='both')
df_B['start_date'] = df_B['event_date'].apply(lambda x : df_A.iloc[df_A.index.get_loc(x)]['start_date'])
df_B

However, I get a keyerror as some of values on event_date column of df_B do not lie between any of the start_date and end_date of columns in df_A.

THis is my expected output:

event_date  price   start_date
0   2021-04-01 00:06:00 100 2021-04-01
1   2021-05-01 00:03:00 200 2021-05-01
2   2021-05-04 00:00:00 500 NaN

I tried using intersection but that works only if the values are exactly equal. How do I do this?

like image 818
user42 Avatar asked Jan 24 '26 09:01

user42


2 Answers

If your start_date and end_date do not overlap, create an interval index and merge your two dataframes:

bins = pd.IntervalIndex.from_arrays(df_A['start_date'], 
                                    df_A['end_date'],
                                    closed='both')

out = df_B.assign(interval=pd.cut(df_B['event_date'], bins)) \
          .merge(df_A.assign(interval=bins), on='interval', how='left')

print(out[['event_date', 'price', 'start_date']])

# Output:
           event_date  price start_date
0 2021-04-01 00:06:00    100 2021-04-01
1 2021-05-01 00:03:00    200 2021-05-01
2 2021-05-04 00:00:00    500        NaT
like image 194
Corralien Avatar answered Jan 26 '26 21:01

Corralien


If you can avoid a non-equi join, then do so; use a bucketing technique, similar to @Corralien's excellent answer, where the bins do not overlap (faster/more efficient). Another option, if possible, is to find a way to introduce merge joins, and maybe union your final output -> the steps may be longer, however, I feel if done right, it would be more efficient than a non-equi join. If however, you care only for the first match, then pd.merge_asof should suffice.

Back to the question :

One option is via the conditional_join from pyjanitor :

# pip install pyjanitor
import pandas as pd
import janitor

(df_A.conditional_join(
       df_B,  
       # column from left frame, column from right frame, comparision operator
       ('start_date', 'event_date', '<='), 
       ('end_date', 'event_date', '>='), 
       how = 'right')
   .filter(['event_date', 'price', 'start_date'])
)

           event_date  price start_date
0 2021-04-01 00:06:00    100 2021-04-01
1 2021-05-01 00:03:00    200 2021-05-01
2 2021-05-04 00:00:00    500        NaT

What it does is compare columns from df_A with df_B:

('start_date', 'event_date', '<=') means check if start_date from df_A is <= event_date from df_B. same concept applies to the other conditions. The parameter here is variable args, meaning you can pass as many conditions as possible. The columns must be same type (float vs float, integer vs integer, date vs date, ...).

Under the hood, it uses binary search for non-equi joins (primarily to avoid a cartesian join, which can be inefficient memory wise as the dataframe size grows); for range joins such as in this case, it uses a simple search to reduce the search space to get the matching rows.

The function can handle overlapping indices, as well as scenarios where the range join involves two different columns -> a < b and c > d.

You can have a look at more examples here

like image 40
sammywemmy Avatar answered Jan 26 '26 22:01

sammywemmy