Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create new dataframe from aggregate of original df and calculations

I want to create a new dataframe from my original dataframe that aggregates by two columns and has a calculated column dependent on the sum of selected rows from two other columns.

Here is a sample df:

df = pd.DataFrame([['A','X',2000,5,3],['A','X',2001,6,2],['B','X',2000,6,3],['B','X',2001,7,2],['C','Y',2000,10,4],['C','Y',2001,12,4],['D','Y',2000,11,2],['D','Y',2001,15,1]],
                   columns=['ctry','rgn','year','val1, val2']))

and what it looks like:

   ctry rgn  year  val1  val2
0     A   X  2000     5     3
1     A   X  2001     6     2
2     B   X  2000     6     3
3     B   X  2001     7     2
4     C   Y  2000     10    4
5     C   Y  2001     12    4
6     D   Y  2000     11    2
7     D   Y  2001     15    1

I ultimately want a new dataframe that gets rid of the ctry column and groups by the rgn and year, and has a calculated column value dependent on val1 and val2 such that the sum of the product of val1 and val2 is divided by the sum of val2 for a rgn and year:

df['value'] = ∑(val1*val2)/∑val2 for each rgn and year

   rgn  year  value
0    X  2000  5.5
1    X  2001  6.5
2    Y  2000  10.333333
3    Y  2001  12.6

I ended up successfully doing so:

df['calc'] = df['val1'] * df['val2']
new_df = df.groupby(['rgn', 'year']).sum()
new_df['value'] = new_df['calc']/new_df['val2']
new_df = new_df.reset_index().rename_axis(None, axis=1)
new_df = new_df.drop(columns=['ctry', 'val1', 'val2', 'calc'])

However, I'd like to know if there is a more succinct way that doesn't require all these steps, perhaps using the lambda function. Appreciate any help I get. Thanks!

like image 824
M S Avatar asked Nov 15 '25 10:11

M S


1 Answers

Pre-compute value = val1*val2, perform a groupby.sum, then compute value/val1:

out = (df.eval('value = val1*val2')
         .groupby(['rgn', 'year'], as_index=False)
         [['value', 'val2']].sum()
         .assign(value=lambda x: x['value']/x.pop('val2'))
      )

Less efficient, but more flexible, you can also use groupby.apply:

out = (df.groupby(['rgn', 'year'])
         .apply(lambda x: (x['val1']*x['val2']).sum()/x['val2'].sum(),
                include_groups=False)
         .reset_index(name='value')
      )

Output:

  rgn  year      value
0   X  2000   5.500000
1   X  2001   6.500000
2   Y  2000  10.333333
3   Y  2001  12.600000
like image 157
mozway Avatar answered Nov 17 '25 09:11

mozway



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!