Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas DataFrame custom agg function strange behavior

I'd like to aggregate a Pandas DataFrame along an axis using a custom function, and I'm having trouble figuring out what the function should return.

df = pd.DataFrame(np.arange(50).reshape(10,5))

You can pass numpy functions to DataFrame.agg:

# Case 1
df.agg([np.mean], axis=1)

And you get what you expect: a DataFrame indexed just like df, but with one column: 'mean'. But for some reason, the following behave completely differently:

# Case 2
df.agg([lambda x:np.mean(x)], axis=1)

or even

# Case 3
def f(x, **kwargs):
    return np.mean(x, **kwargs)

df.agg([f], axis=1)

Why should the latter two work any differently than the first case?

like image 682
nvd81 Avatar asked Oct 21 '25 12:10

nvd81


1 Answers

If I am not mistaken, what is happening in Case 2 is that the np.mean() operation is flattening the array first, so the mean of each column of each row entry is being calculated, which is why you get the mean of every single entry in the DataFrame when you run df.agg([lambda x:np.mean(x)], axis=1) which returns:

               0     1     2     3     4
0 <lambda>   0.0   1.0   2.0   3.0   4.0
1 <lambda>   5.0   6.0   7.0   8.0   9.0
2 <lambda>  10.0  11.0  12.0  13.0  14.0
3 <lambda>  15.0  16.0  17.0  18.0  19.0
4 <lambda>  20.0  21.0  22.0  23.0  24.0
5 <lambda>  25.0  26.0  27.0  28.0  29.0
6 <lambda>  30.0  31.0  32.0  33.0  34.0
7 <lambda>  35.0  36.0  37.0  38.0  39.0
8 <lambda>  40.0  41.0  42.0  43.0  44.0
9 <lambda>  45.0  46.0  47.0  48.0  49.0

There's a specific point about how numpy aggregation functions are different than pandas aggregation operations in the pandas documentation on the aggregation function.

To make Case 2 behave as Case 1 does, you can specify the axis in the np.mean() function itself: df.agg([lambda x:np.mean(x,axis=0)],axis=1), which returns the following:

   <lambda>
0       2.0
1       7.0
2      12.0
3      17.0
4      22.0
5      27.0
6      32.0
7      37.0
8      42.0
9      47.0

Similarly, you can make Case 3 behave as Case 1 does by specifying axis=0 in the np.mean() function:

def f(x, **kwargs):
    return np.mean(x, axis=0, **kwargs)

df.agg([f], axis=1)

And this returns:

      f
0   2.0
1   7.0
2  12.0
3  17.0
4  22.0
5  27.0
6  32.0
7  37.0
8  42.0
9  47.0
like image 50
Derek O Avatar answered Oct 25 '25 03:10

Derek O