Say, I have the data like this:
col1 col2 other columns..
0 0 ...
0 0 ...
0 0 ...
0 0 ...
0 0 ...
0 0 ...
0 0 ...
0 0 ...
0 0 ...
0 0 ...
0 1 ...
0 1 ...
0 1 ...
0 1 ...
0 1 ...
0 1 ...
1 0 ...
1 0 ...
etc...
Data has been grouped by 2 columns (it's already result by grouping):
gr = df.groupby(['col1', 'col2']).size()
col1 col2
0 0 10
1 5
1 0 2
1 16
2 0 10
So now I need to figure out which percentage of each subgroup the count has respectively the whole group by 2 columns:
I need to add one more column, or transform to Series (better) to have a percentage of col2 respectively the group (col1) like:
col1 col2
0 0 0.66
1 0.33
1 0 0.1
1 0.9
2 0 1
Or it can be separate Series for each group: [0.66 0.1 1]
and [0.33 0.9]
.
How to implement it?
So let me describe the meaning of this data. For example it can be subjects (0,1,2), results (0 or 1) and amount students per subject per result. So the whole idea is to figure out what percentage of students failed/passed for subjects 0,1, etc.
One more additional thing - sometimes there is only one result (0 or 1) like the subject that all students passed, and I still need to be able to tell that for this subject percentage of 0 is 0,0 and of 1 is 1.
You need groupby
by first level of index with sum
:
gr = df.groupby(['col1', 'col2']).size()
print (gr)
col1 col2
0 0 10
1 5
1 0 2
1 16
2 0 10
dtype: int64
print (gr.groupby(level=0).sum())
col1
0 15
1 18
2 10
dtype: int64
print (gr / gr.groupby(level=0).sum())
col1 col2
0 0 0.666667
1 0.333333
1 0 0.111111
1 0.888889
2 0 1.000000
dtype: float64
For storing Series
use dict comprehension:
dfs = {i:g.reset_index(drop=True) for i, g in g1.groupby(level=1)}
print (dfs[0])
0 0.666667
1 0.111111
2 1.000000
dtype: float64
print (dfs[1])
0 0.333333
1 0.888889
dtype: float64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With