Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Aggregation on sub DataFrames defined by sets of indices without loop

Tags:

python

pandas

Suppose I have a Pandas DataFrame, I take some easy example:

import pandas as pd
df = pd.DataFrame(columns=["A", "B"], data = [(1, 2), (4, 5), (7, 8), (10, 11)])

I have a set of indices, let's make it simple and random:

inds = [(0, 1, 3), (0, 1, 2), (1, 2, 3)]

I want to aggregate the data according to those indices, in the following way, for instance if the aggregation operation is the mean I would obtain:

A B
df.loc[inds[0], "A"].mean() df.loc[inds[0], "B"].mean()
df.loc[inds[1], "A"].mean() df.loc[inds[1], "B"].mean()
df.loc[inds[2], "A"].mean() df.loc[inds[2], "B"].mean()

Is there a way to perform this in pure pandas without writing a loop?

This is very similar to a df.groupby and then .agg type of operation, but I did not find a way to create a GroupBy object from a custom set of indices.

like image 438
DimB Avatar asked Sep 06 '25 19:09

DimB


1 Answers

Edit: showing how to achieve this with groupby, but surely "significantly simpler to think of this as a selection by index problem"; see the answer by @HenryEcker.


Option 1 (reindex + groupby)

s = pd.Series(inds).explode()

out = df.reindex(s).groupby(s.index).mean()

out

     A    B
0  5.0  6.0 # i.e. A: (1+4+10)/3, B: (2+5+11)/3, etc.
1  4.0  5.0
2  7.0  8.0

Explanation

  • Use inds to create a pd.Series (here: s), and apply series.explode. The index values function as group identifiers:
# intermediate series ('group 0, 1, 2')

0    0
0    1
0    3
1    0
1    1
1    2
2    1
2    2
2    3
dtype: object
  • Apply df.reindex with values from s, use df.groupby with s.index, and get groupby.mean.

Option 2 (merge + groupby)

out = (
    df.merge(
        pd.Series(inds, name='g').explode(), 
        left_index=True, 
        right_on='g', 
        how='right'
        )
    .drop(columns=['g'])
    .groupby(level=0)
    .mean()
    )

# same result

Explanation

  • As with option 1, we create a pd.Series and explode it, but this time we add a name, which we need for the merge in the next step.
  • Now, use df.merge with how=right to add the values from df using g values from our series and index from df as the keys.
  • Finally, drop column 'g' (df.drop), apply df.groupby on the index (level=0), and get groupby.mean.
like image 144
ouroboros1 Avatar answered Sep 09 '25 13:09

ouroboros1