Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to aggregate a subset of rows in and append to a MultiIndexed Pandas DataFrame?

Problem Setup & Goal

I have a Multindexed Pandas DataFrame that looks like this:

import pandas as pd

df = pd.DataFrame({
    'Values':[1, 3, 4, 8, 5, 2, 9, 0, 2],
    'A':['A1', 'A1', 'A1', 'A1', 'A2', 'A2', 'A3', 'A3', 'A3'],
    'B':['foo', 'bar', 'fab', 'baz', 'foo', 'baz', 'qux', 'baz', 'bar']
})
df.set_index(['A','B'], inplace=True)
print(df.to_string())

        Values
A  B          
A1 foo       1
   bar       3
   fab       4
   baz       8
A2 foo       5
   baz       2
A3 qux       9
   baz       0
   bar       2

My ultimate goal is to replace all the "bar" and "baz" rows in the B column with a summed row called "other" (see below) in the simplest, most canonical Pandas way.

       Values
A  B          
A1 foo       1
   fab       4
   other    11
A2 foo       5
   other     2
A3 qux       9
   other     2

Current Work

I managed to figure out how to create a mask for a MultiIndex DataFrame from a similar problem to highlight the rows we want to eventually aggregate, which are in an agg_list.

agg_list = ['bar', 'baz']
# Create a mask that highlights the rows in B that are in agg_list
filterFunc = lambda x: x.index.get_level_values('B') in agg_list
mask = df.groupby(level=['A','B']).apply(filterFunc)

This produces the expected mask:

print(mask.to_string())

A   B  
A1  bar     True
    baz     True
    fab    False
    foo    False
A2  baz     True
    foo    False
A3  bar     True
    baz     True
    qux    False

And I know how to remove the rows I no longer need:

# Remove rows in B col that are in agg_list using mask
df_masked = df[[~mask.loc[i1, i2] for i1,i2 in df.index]]
print(df_masked.to_string())

    Values
A  B          
A1 foo       1
   fab       4
A2 foo       5
A3 qux       9

But I don't know how to do the actual aggregation/sum on these rows and append it to each Multindexed row.

Similar Problems/Solutions

Similar problems I've seen didn't involve a Multindex DataFrame, so I can't quite use some of the solutions like this one, which has the same general idea of creating a mask and then append a summed row:

threshold = 6
m = df['value'] < threshold
df1 = df[~m].copy()
df1.loc['Z'] = df.loc[m, 'value'].sum()

or

m = df['value'] < threshold
df1 = df[~m].append(df.loc[m, ['value']].sum().rename('Z'))
like image 223
water_with_camphor Avatar asked Sep 18 '25 10:09

water_with_camphor


1 Answers

You can do something as simple as below

b_others = df.B.replace({'bar': 'other', 'baz': 'other'})
df.groupby(['A', b_others]).sum()

Create the variable with replaced 'bar' and 'baz' values. Then, just use it to group.

Output

          Values
A  B            
A1 fab         4
   foo         1
   other      11
A2 foo         5
   other       2
A3 other       2
   qux         9
like image 86
Vishnudev Avatar answered Sep 19 '25 23:09

Vishnudev