Say we have a dataframe df with a list of dates in chronological order by day.
The goal is to obtain the number of people on a given day that had a date range that included the given date.
df = pd.DataFrame(data={'date': [datetime.date(2007, 12, 1),
datetime.date(2007, 12, 2),
datetime.date(2007, 12, 3)],
'num_people_on_day': [0,0,0]})
dg = pd.DataFrame(data={'person': ['Alice', 'Bob', 'Chuck'],
'start': [datetime.date(2007, 11, 5),
datetime.date(2007, 12, 8),
datetime.date(2007, 1, 5)],
'end': [datetime.date(2007, 12, 6),
datetime.date(2008, 1, 3),
datetime.date(2007, 11, 30)]})

So for each date in df, how can I check all of dg efficiently and then count the number returned and put it into df.
I am not even sure if a merge is necessary here (also trying to save memory), and am really trying to write it to be as fast as possible.
EDIT: Ok so I have come up with a different way to do this, but I hate using apply. Is there a way to do this new way without using .apply?
import pandas as pd
import datetime
df = pd.DataFrame(data={'date': [datetime.date(2007, 12, 1),
datetime.date(2007, 12, 2),
datetime.date(2007, 12, 3)]})
dg = pd.DataFrame(data={'person': ['Alice', 'Bob', 'Chuck', 'Dave'],
'start': [datetime.date(2007, 11, 5),
datetime.date(2007, 12, 8),
datetime.date(2007, 1, 5),
datetime.date(2007, 11, 6)],
'end': [datetime.date(2007, 12, 1),
datetime.date(2008, 1, 3),
datetime.date(2007, 11, 30),
datetime.date(2007, 12, 2)]})
def get_num_persons(date, vec_starts, vec_ends):
"""
Helper function for .apply to get number of persons.
For each given date, if start and end date is
between the given date, then both results are True.
The bitwise AND then only sums these True and True values.
"""
return (((vec_starts <= date) & (vec_ends >= date)).sum())
def num_of_persons(persons, dates_df):
"""
Obtains the number of persons for each day.
"""
dates_df['num_persons'] = dates_df['date'].apply(lambda row:
get_num_persons(row,
persons['start'],
persons['end']))
return dates_df
num_of_persons(dg, df.copy())
With sufficient memory, merge then count the dates that fall in between. .reindex ensures we get the 0s.
#df['date'] = pd.to_datetime(df.date)
#dg['start'] = pd.to_datetime(dg.start)
#dg['end'] = pd.to_datetime(dg.end)
m = df[['date']].assign(k=1).merge(dg.assign(k=1))
(m[m.date.between(m.start, m.end)].groupby('date').size()
.reindex(df.date).fillna(0)
.rename('num_people_on_day').reset_index())
date num_people_on_day
0 2007-12-01 1
1 2007-12-02 1
2 2007-12-03 1
The other option is to use the apply. This is a loop, so performance suffers as df grows.
def get_len(x, dg):
try:
return len(dg.iloc[dg.index.get_loc(x)])
except KeyError: # Deal with dates that have 0
return 0
dg.index = pd.IntervalIndex.from_arrays(dg['start'], dg['end'], closed='both')
df['num_people_on_day'] = df['date'].apply(get_len, dg=dg)
To illustrate the timings, look at your small set, then a much larger df.
%%timeit
m = df[['date']].assign(k=1).merge(dg.assign(k=1))
(m[m.date.between(m.start, m.end)].groupby('date').size()
.reindex(df.date).fillna(0)
.rename('num_people_on_day').reset_index())
#9.39 ms ± 52 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
dg.index = pd.IntervalIndex.from_arrays(dg['start'], dg['end'], closed='both')
df['num_people_on_day'] = df['date'].apply(get_len, dg=dg)
#4.06 ms ± 27.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
But once df is longer (even just 90 rows), the difference becomes apparent.
df = pd.DataFrame({'date': pd.date_range('2007-01-01', '2007-03-31')})
%%timeit merge
#9.78 ms ± 75.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit IntervalIndex
#65.5 ms ± 418 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With