I have a dataframe df that consists of two columns: an id, and a date. The id is a number from 1-3 & is not unique; the date is a datetime object.
id, date
1, 2020-5-11
1, 2019-3-2
2, 2018-7-29
3, 2017-2-21
3, 2017-2-23
There also exists a corresponding list which contains the id with a linked start and end date:
id, startdate, enddate
1, 2015-1-1, 2020-1-1
2, 2019-2-1, 2020-2-31
3, 2017-2-1, 2020-2-31
I wish to filter df to exclude all rows where for each id the date is outwith the range of the startdate and enddate as specified in the above list.
I can do it per id, using the below code; but I do not know how to do them all simultaneously.
for each, startdate, enddate in ids:
mask = (df['date'].dt.date >= startdate) & (df['date'].dt.date <= enddate)
df[df['id']==each].loc[mask]
The desired output for the dummy example would be a dataframe containing the following (i.e. excluding all rows for which the date is outwith the date range specific for each id)
id, date
1, 2019-3-2
3, 2017-2-21
3, 2017-2-23
A straightforward and efficient approach is to use map to assign the start/end date to each id in df1['id'], then to perform boolean indexing with between:
tmp = df2.set_index('id')
out = df1[df1['date'].between(df1['id'].map(tmp['startdate']),
df1['id'].map(tmp['enddate']))]
Output:
id date
1 1 2019-03-02
3 3 2017-02-21
4 3 2017-02-23
Timings:

NB. the merge approach only works if df1 has a range index ([0, 1, 2, ...]).
The timings seem to be independent from the size of df2. Here with 100k rows for df1 and between 1 and 100 rows for df2:

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With