Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speeding up rolling sum calculation in pandas groupby

I want to compute rolling sums group-wise for a large number of groups and I'm having trouble doing it acceptably quickly.

Pandas has build-in methods for rolling and expanding calculations

Here's an example:

import pandas as pd
import numpy as np
obs_per_g = 20
g = 10000
obs = g * obs_per_g
k = 20
df = pd.DataFrame(
    data=np.random.normal(size=obs * k).reshape(obs, k),
    index=pd.MultiIndex.from_product(iterables=[range(g), range(obs_per_g)]),
)

To get rolling and expanding sums I can use

df.groupby(level=0).expanding().sum()
df.groupby(level=0).rolling(window=5).sum()

But this takes a long time for a very large number of groups. For expanding sums, using instead the pandas method cumsum is almost 60 times quicker (16s vs 280ms for the above example) and turns hours into minutes.

df.groupby(level=0).cumsum()

Is there a fast implementation of rolling sum in pandas, like cumsum is for expanding sums? If not, could I use numpy to accomplish this?

like image 537
CloseToC Avatar asked Oct 23 '25 04:10

CloseToC


2 Answers

I had the same experience with .rolling() its nice, but only with small datasets or if the function you are applying is non standard, with sum() I would suggest using cumsum() and subtracting cumsum().shift(5)

df.groupby(level=0).cumsum() - df.groupby(level=0).cumsum().shift(5)
like image 169
Mark Avatar answered Oct 25 '25 18:10

Mark


To provide the latest information on this, if you upgrade pandas, the performance of groupby rolling has been significantly improved. This is approx 4-5 times faster in 1.1.0 and x12 faster in >1.2.0 compared to 0.24 or 1.0.0.

I believe the biggest performance improvement comes from this PR which means it can do more in cython (before it was implemented like groupby.apply(lambda x: x.rolling())).

I used the below code to benchmark:

import pandas
import numpy

print(pandas.__version__)
print(numpy.__version__)


def stack_overflow_df():
    obs_per_g = 20
    g = 10000
    obs = g * obs_per_g
    k = 2
    df = pandas.DataFrame(
        data=numpy.random.normal(size=obs * k).reshape(obs, k),
        index=pandas.MultiIndex.from_product(iterables=[range(g), range(obs_per_g)]),
    )
    return df


df = stack_overflow_df()

# N.B. droplevel important to make indices match
rolling_result = (
    df.groupby(level=0)[[0, 1]].rolling(10, min_periods=1).sum().droplevel(level=0)
)
df[["value_0_rolling_sum", "value_1_rolling_sum"]] = rolling_result
%%timeit
# results:
# numpy version always 1.19.4
# pandas 0.24 = 12.3 seconds
# pandas 1.0.5 = 12.9 seconds
# pandas 1.1.0 = broken with groupby rolling bug
# pandas 1.1.1 = 2.9 seconds
# pandas 1.1.5 = 2.5 seconds
# pandas 1.2.0 = 1.06 seconds
# pandas 1.2.2 = 1.06 seconds

I think care must be taken if trying to use numpy.cumsum to improve performance (regardless of pandas version). For example, using something like the below:

# Gives different output
df.groupby(level=0)[[0, 1]].cumsum() - df.groupby(level=0)[[0, 1]].cumsum().shift(10)

While this is much faster, the output is not correct. This shift is performed over all rows and mixes the cumsum of different groups. i.e. The first result of the next group is shifted back to the previous group.

To have the same behaviour as above, you need to use apply:

df.groupby(level=0)[[0, 1]].cumsum() - df.groupby(level=0)[[0, 1]].apply(
    lambda x: x.cumsum().shift(10).fillna(0)
)

which, in the most recent version (1.2.2), is slower than using rolling directly. Hence, for groupby rolling sums, I don't think numpy.cumsum is the best solution for pandas>=1.1.1

For completeness, if your groups are columns rather than the index, you should use syntax like this:

# N.B. reset_index important to make indices match
rolling_result = (
    df.groupby(["category_0", "category_1"])[["value_0", "value_1"]]
    .rolling(10, min_periods=1)
    .sum()
    .reset_index(drop=True)
)
df[["value_0_rolling_sum", "value_1_rolling_sum"]] = rolling_result
like image 44
user2175850 Avatar answered Oct 25 '25 16:10

user2175850



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!