I'm trying to work out how to use the groupby function in pandas to work out the proportions of values per year with a given Yes/No criteria.
For example, I have a dataframe called names:
  Name  Number  Year   Sex Criteria
0  name1     789  1998  Male      N
1  name1     688  1999  Male      N
2  name1     639  2000  Male      N
3  name2     551  1998  Male      Y
4  name2     499  1999  Male      Y
I can use
namesgrouped = names.groupby(["Sex", "Year", "Criteria"]).sum()
to get:
                   Number
Sex    Year      Criteria
Male   1998 N        14507
            Y         2308
       1999 N        14119
            Y         2331
and so on. I would like the 'Number Criteria' column to show the % of the total for each gender and year - so instead of N = 14507 and Y = 2308 for 1998 above I'd have N = 86.27% and Y = 13.73%.
Can anyone advise how to do this?
You can calculate the percentage of total with the groupby of pandas DataFrame by using DataFrame. groupby() , DataFrame. agg() , DataFrame. transform() methods and DataFrame.
groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.
First, create a data frame as 'data_frame' and provide the values you need to calculate the cumulative sum, then pass the 'data_frame' parameter to pd. DataFrame() while specifying the column values, and finally, use the cumsum() and sum() built-in functions to calculate the cumulative percentage.
groupby() can accept several different arguments: A column or list of columns. A dict or pandas Series. A NumPy array or pandas Index , or an array-like iterable of these.
This question is a direct extension of the suggested duplicate. Borrowing from the accepted answer, this will work:
In [46]: namesgrouped.groupby(level=[0, 1]).apply(lambda g: g / g.sum())
Out[46]: 
                      Number
Sex  Year Criteria          
Male 1998 N         0.588806
          Y         0.411194
     1999 N         0.579612
          Y         0.420388
     2000 N         1.000000
Edit: a transform operation might be faster than apply:
namesgrouped / namesgrouped.groupby(level=[0, 1]).transform('sum')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With