In Python, to obtain summaries by group, I use groupby().agg(fx()); eg groupby('variable').agg('sum'). What is the difference between that and directly using the function, eg; groupby('variable').sum() ?
Setup
df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
The primary benefit of using agg is stated in the docs:
Aggregate using one or more operations over the specified axis.
If you have separate operations that need to be applied to each individual column, agg takes a dictionary (or a function, string, or list of strings/functions) that allows you to create that mapping in a single statement. So if you'd like the sum of column a, and the mean of column b:
df.agg({'a': 'sum', 'b': 'mean'})
a 6.0
b 5.0
dtype: float64
It also allows you to apply multiple operations to a single column in a single statement. For example, to find the sum, mean, and std of column a:
df.agg({'a': ['sum', 'mean', 'std']})
a
sum 6.0
mean 2.0
std 1.0
There's no difference in outcome when you use agg with a single operation. I'd argue that df.agg('sum') is less clear than df.sum(), but the results will be the same:
df.agg('sum')
a 6
b 15
dtype: int64
df.sum()
a 6
b 15
dtype: int64
The main benefit agg provides is the convenience of applying multiple operations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With