Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - count percentage of group size

Say, I have the data like this:

col1   col2 other columns..
0      0    ...
0      0    ...
0      0    ...
0      0    ...
0      0    ...
0      0    ...
0      0    ...
0      0    ...
0      0    ...
0      0    ...
0      1    ...
0      1    ...
0      1    ...
0      1    ...
0      1    ...
0      1    ...
1      0    ...
1      0    ...
etc...

Data has been grouped by 2 columns (it's already result by grouping):

gr = df.groupby(['col1', 'col2']).size()

col1   col2        
0      0           10
       1           5
1      0           2
       1           16
2      0           10

So now I need to figure out which percentage of each subgroup the count has respectively the whole group by 2 columns:

I need to add one more column, or transform to Series (better) to have a percentage of col2 respectively the group (col1) like:

       col1        col2
0      0           0.66
       1           0.33
1      0           0.1
       1           0.9
2      0           1

Or it can be separate Series for each group: [0.66 0.1 1] and [0.33 0.9]. How to implement it?

So let me describe the meaning of this data. For example it can be subjects (0,1,2), results (0 or 1) and amount students per subject per result. So the whole idea is to figure out what percentage of students failed/passed for subjects 0,1, etc.

One more additional thing - sometimes there is only one result (0 or 1) like the subject that all students passed, and I still need to be able to tell that for this subject percentage of 0 is 0,0 and of 1 is 1.

like image 227
mimic Avatar asked Oct 17 '25 02:10

mimic


1 Answers

You need groupby by first level of index with sum:

gr = df.groupby(['col1', 'col2']).size()
print (gr)
col1  col2
0     0       10
      1        5
1     0        2
      1       16
2     0       10
dtype: int64

print (gr.groupby(level=0).sum())
col1
0    15
1    18
2    10
dtype: int64

print (gr / gr.groupby(level=0).sum())
col1  col2
0     0       0.666667
      1       0.333333
1     0       0.111111
      1       0.888889
2     0       1.000000
dtype: float64

For storing Series use dict comprehension:

dfs = {i:g.reset_index(drop=True) for i, g in g1.groupby(level=1)}

print (dfs[0])
0    0.666667
1    0.111111
2    1.000000
dtype: float64

print (dfs[1])
0    0.333333
1    0.888889
dtype: float64
like image 183
jezrael Avatar answered Oct 18 '25 16:10

jezrael



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!