Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: fast aggregation of many observations to daily sum

Tags:

python

pandas

I have observations with start and end date of the following format:

import pandas as pd

data = pd.DataFrame({
    'start_date':pd.to_datetime(['2021-01-07','2021-01-04','2021-01-12','2021-01-03']),
    'end_date':pd.to_datetime(['2021-01-16','2021-01-12','2021-01-13','2021-01-15']),
    'value':[7,6,5,4]
    })

data

    start_date  end_date    value
0   2021-01-07  2021-01-16  7
1   2021-01-04  2021-01-12  6
2   2021-01-12  2021-01-13  5
3   2021-01-03  2021-01-15  4

The date ranges between observations overlap. I would like to compute the daily sum aggregated across all observations.

My version with a loop (below) is slow and crashes for ~100k observations. What would be a way to speed things up?

def turn_data_into_date_range(row):
  dates = pd.date_range(start=row.start_date, end=row.end_date)
  return pd.Series(data=row.value, index=dates)

out = []
for index, row in data.iterrows():
  out.append(turn_data_into_date_range(row))

result = pd.concat(out, axis=1).sum(axis=1)

result
2021-01-03     4.0
2021-01-04    10.0
2021-01-05    10.0
2021-01-06    10.0
2021-01-07    17.0
2021-01-08    17.0
2021-01-09    17.0
2021-01-10    17.0
2021-01-11    17.0
2021-01-12    22.0
2021-01-13    16.0
2021-01-14    11.0
2021-01-15    11.0
2021-01-16     7.0
Freq: D, dtype: float64

PS: the answer to this related question doesn't work in my case, as they have non-overlapping observations and can use a left join: Convert Date Ranges to Time Series in Pandas

like image 924
curious_know_nothing Avatar asked Oct 18 '25 12:10

curious_know_nothing


1 Answers

I feel this problem comes back regularly as it’s not an easy thing to do. Some techniques would probably transform each row into a date range or otherwise iterate on rows. In this case there’s a smarter workaround, which is to use cumulative sums, then reindex.

>>> starts = data.set_index('start_date')['value'].sort_index().cumsum()
>>> starts
start_date
2021-01-03     4
2021-01-04    10
2021-01-07    17
2021-01-12    22
Name: value, dtype: int64
>>> ends = data.set_index('end_date')['value'].sort_index().cumsum()
>>> ends
end_date
2021-01-12     6
2021-01-13    11
2021-01-15    15
2021-01-16    22
Name: value, dtype: int64

In case your dates are not unique, you could group by by date and sum first. Then the series definitions are as follows:

>>> starts = data.groupby('start_date')['value'].sum().sort_index().cumsum()
>>> ends = data.groupby('end_date')['value'].sum().sort_index().cumsum()

Note that here we don’t need the set_index() anymore which is done by sum() as it is an aggregation, contrarily to .cumsum()which is a transform operation.

Of course if the ends are inclusive you might need to add a .shift():

>>> dates = pd.date_range(starts.index.min(), ends.index.max())
>>> ends.reindex(dates).ffill().shift().fillna(0)
2021-01-03     0.0
2021-01-04     0.0
2021-01-05     0.0
2021-01-06     0.0
2021-01-07     0.0
2021-01-08     0.0
2021-01-09     0.0
2021-01-10     0.0
2021-01-11     0.0
2021-01-12     0.0
2021-01-13     6.0
2021-01-14    11.0
2021-01-15    11.0
2021-01-16    15.0
Freq: D, Name: value, dtype: float64

Then just subtract the (possibly shifted) ends from the starts:

>>> starts.reindex(dates).ffill() - ends.reindex(dates).ffill().shift().fillna(0)
2021-01-03     4.0
2021-01-04    10.0
2021-01-05    10.0
2021-01-06    10.0
2021-01-07    17.0
2021-01-08    17.0
2021-01-09    17.0
2021-01-10    17.0
2021-01-11    17.0
2021-01-12    22.0
2021-01-13    16.0
2021-01-14    11.0
2021-01-15    11.0
2021-01-16     7.0
Freq: D, Name: value, dtype: float64
like image 69
Cimbali Avatar answered Oct 21 '25 01:10

Cimbali



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!