Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas GroupBy using another DataFrame of one-hot encodings/overlapping masks

I have two dataframes with observations on rows and features (or group membership) on columns, e.g.:

> data_df

     a    b    c
A    1    2    1
B    0    1    3
C    0    0    1
D    2    1    1
E    1    1    1

> mask_df

    g1   g2
A    0    1
B    1    0
C    1    0
D    1    0
E    0    1

I want to group and aggregate (by sum) the values in the first dataframe (data_df) conditional on the binary values (masks) in the second dataframe (mask_df). The result should be the following (groups x features):

> aggr_df

     a    b    c
g1   2    2    5
g2   2    3    2

Is there a way in pandas to group the first dataframe (data_df) using the masks contained in a second dataframe (mask_df) in a single command?

like image 849
gc5 Avatar asked Sep 05 '25 03:09

gc5


1 Answers

You can do this cheaply with dot and groupby:

data_df.groupby(mask_df.dot(mask_df.columns)).sum()

    a  b  c
g1  2  2  5
g2  2  3  2

Where,

mask_df.dot(mask_df.columns)

A    g2
B    g1
C    g1
D    g1
E    g2
dtype: object

Which works well assuming each row always has exactly one column set to 1.

like image 142
cs95 Avatar answered Sep 07 '25 21:09

cs95