Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas: join dataframes based on time interval

I have a data frame with a datetime column every 10 minutes and a numerical value:

df1 = pd.DataFrame({'time' : pd.date_range('1/1/2018', periods=20, freq='10min'), 'value' : np.random.randint(2, 20, size=20)})

And another with a schedule of events, with a start time and end time. There can be multiple events happening at the same time:

df2 = pd.DataFrame({'start_time' : ['2018-01-01 00:00:00', '2018-01-01 00:00:00','2018-01-01 01:00:00', '2018-01-01 01:00:00', '2018-01-01 01:00:00', '2018-01-01 02:00:00' ], 'end_time' : ['2018-01-01 01:00:00', '2018-01-01 01:00:00', '2018-01-01 02:00:00','2018-01-01 02:00:00', '2018-01-01 02:00:00', '2018-01-01 03:00:00'], 'event' : ['A', 'B', 'C', 'D', 'E', 'F'] })
df2[['start_time', 'end_time']] = df2.iloc[:,0:2].apply(pd.to_datetime)

I want to do a left join on df1, with all events that fall inside the start and end times. My output table should be:

                  time  value event
0  2018-01-01 00:00:00      5     A
1  2018-01-01 00:00:00      5     B
2  2018-01-01 00:10:00     15     A
3  2018-01-01 00:10:00     15     B
4  2018-01-01 00:20:00     16     A
5  2018-01-01 00:20:00     16     B
.....
17 2018-01-01 02:50:00      7     F

I attempted these SO solutions, but they fail because of duplicate time intervals.

like image 900
robroc Avatar asked Sep 18 '25 16:09

robroc


1 Answers

Setup (Only using a few entries from df1 for brevity):

df1 = pd.DataFrame({'time' : pd.date_range('1/1/2018', periods=20, freq='10min'), 'value' : np.random.randint(2, 20, size=20)})
df2 = pd.DataFrame({'start_time' : ['2018-01-01 00:00:00', '2018-01-01 00:00:00','2018-01-01 01:00:00', '2018-01-01 01:00:00', '2018-01-01 01:00:00', '2018-01-01 02:00:00' ], 'end_time' : ['2018-01-01 01:00:00', '2018-01-01 01:00:00', '2018-01-01 02:00:00','2018-01-01 02:00:00', '2018-01-01 02:00:00', '2018-01-01 03:00:00'], 'event' : ['A', 'B', 'C', 'D', 'E', 'F'] })

df1 = df1.sample(5)
df2[['start_time', 'end_time']] = df2.iloc[:,0:2].apply(pd.to_datetime)

You can use a couple of straightfoward list comprehensions to achieve your result. This answer assumes that all date columns are in fact, of type datetime in your DataFrame:

Step 1
Find all events that occur within a particular time range using a list comprehension and simple interval checking:

packed = list(zip(df2.start_time, df2.end_time, df2.event))
df1['event'] = [[ev for strt, end, ev in packed if strt <= el <= end] for el in df1.time]

                  time  value      event
2  2018-01-01 00:20:00      8     [A, B]
14 2018-01-01 02:20:00     14        [F]
8  2018-01-01 01:20:00      6  [C, D, E]
19 2018-01-01 03:10:00     16         []
4  2018-01-01 00:40:00      7     [A, B]

Step 2:

Finally, explode each list from the last result to a new row using another list comprehension:

pd.DataFrame(
    [[t, val, e] for t, val, event in zip(df1.time, df1.value, df1.event)
    for e in event
    ], columns=df1.columns
)

Output:

                 time  value event
0 2018-01-01 00:20:00      8     A
1 2018-01-01 00:20:00      8     B
2 2018-01-01 02:20:00     14     F
3 2018-01-01 01:20:00      6     C
4 2018-01-01 01:20:00      6     D
5 2018-01-01 01:20:00      6     E
6 2018-01-01 00:40:00      7     A
7 2018-01-01 00:40:00      7     B
like image 95
user3483203 Avatar answered Sep 20 '25 06:09

user3483203