Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does groupby().agg(list_funcs) function in Pandas takes significantly more time with a list of functions, than using them individually?

Consider the following code for any dataframe df and any random set of cols ['A', 'B', 'C', 'D']:

df.groupby('A')['B', 'C', 'D'].agg(['mean', 'std', 'count'])

here all the aggregation functions are passed together as a list.

This takes significantly more time than:

grpd = df.groupby('A')['B', 'C', 'D']
grpd.agg('mean')
grpd.agg('std')
grpd.agg('count')

where each aggregation function is being called separately.

This seems counter intuitive, as I expected Pandas to do something under the hood to make it faster.

Can anyone explain why?

like image 860
Nikhil Mishra Avatar asked Dec 05 '25 00:12

Nikhil Mishra


1 Answers

I think reason is pandas use cython optimalized code if called separately, for test added concat for same outputs:

np.random.seed(123)
N = 1000000
df = pd.DataFrame(np.random.randint(1000, size=(N, 4)), columns=list('ABCD'))
print (df)

In [176]: %%timeit
     ...: df.groupby('A')['B', 'C', 'D'].agg(['mean', 'std', 'count'])
     ...: 
274 ms ± 7.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [177]: %%timeit
     ...: grpd = df.groupby('A')['B', 'C', 'D']
     ...: a = grpd.agg('mean')
     ...: b = grpd.agg('std')
     ...: c = grpd.agg('count')
     ...: pd.concat([a,b,c], axis=1)
     ...: 
     ...: 
190 ms ± 980 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [178]: %%timeit
     ...: grpd = df.groupby('A')['B', 'C', 'D']
     ...: a = grpd.mean()
     ...: b = grpd.std()
     ...: c = grpd.count()
     ...: pd.concat([a,b,c], axis=1)
     ...: 
     ...: 
191 ms ± 4.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
like image 105
jezrael Avatar answered Dec 06 '25 13:12

jezrael



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!