I have a dataframe of 2 columns with multiple groups, in this case, column 1 with groups, and column 2 groups, (in this case dates which must be in correct order). I want a cumulative summation of the third column per each unique group in col1 ordered by col2.
I have yet to find an elegant method to accomplish this. The following code will correctly accomplish the task, but sums the whole table, and is very dependent on getting all of the sorts correct.
multidf = df.groupby(by=['col1','col2']).sum().groupby(level=[0]).cumsum()
Will give a hierarchical index solution, but I could not find a way to take the resulting cumulative sum column and attach it to the original dataframe without multistep merge commands.  The reset_index command just converts the table back to its original state. 
Looking for a better method to achieve desired output as shown. I have messed with lambda, apply, aggregrate commands can't quite get anything to work.
In[229]: df#example dataframe
Out[229]: 
  col1                 col2  col3
0    a  2013/01/03 00:00:00     1
1    a  2013/03/05 09:43:31     3
2    b  2013/03/07 00:00:00     4
3    b  2013/03/07 00:00:00     2
4    a  2013/03/07 00:00:00     0
import numpy as np
import pandas as pd
##example dataframe
df = pd.DataFrame({'col1' : ['a','a','b','b','a'],
               'col2' : ['2013/01/03 00:00:00', '2013/03/05 09:43:31', '2013/03/07 00:00:00',\
                         '2013/03/07 00:00:00', '2013/03/07 00:00:00'],
               'col3' : [1,3,4,2,0]})
df = df.sort(['col1','col2'])
jj= df.groupby(by = ['col1'],sort=['col1','col2']).cumsum()
df = df.sort(['col1','col2'])
##multi alternative, can't get result back into original df elegantly
multidf = df.groupby(by=['col1','col2']).sum().groupby(level=[0]).cumsum()
df['cumsum'] = jj['col3']
In[227]: df ## result of unelegant method, desired output though how else can i achieve this?
Out[227]: 
  col1                 col2  col3 cumsum
0    a  2013/01/03 00:00:00     1      1
1    a  2013/03/05 09:43:31     3      4
4    a  2013/03/07 00:00:00     0      4
2    b  2013/03/07 00:00:00     4      4
3    b  2013/03/07 00:00:00     2      6
It would be easier to set the column you want to sort as index. Then use groupby.transform(pd.Series.cumsum). See below:
In [1]: df
Out[1]: 
  col1                 col2  col3
0    a  2013/01/03 00:00:00     1
1    a  2013/03/05 09:43:31     3
2    b  2013/03/07 00:00:00     4
3    b  2013/03/07 00:00:00     2
4    a  2013/03/07 00:00:00     0
In [2]: df1=df.set_index('col2').sort_index()
In [3]: df1
Out[3]: 
                    col1  col3
col2                          
2013/01/03 00:00:00    a     1
2013/03/05 09:43:31    a     3
2013/03/07 00:00:00    b     4
2013/03/07 00:00:00    b     2
2013/03/07 00:00:00    a     0
In [4]: df1['cumsum']=df1.groupby('col1')['col3'].transform(pd.Series.cumsum)
In [5]: df1
Out[5]: 
                    col1  col3  cumsum
col2                                  
2013/01/03 00:00:00    a     1       1
2013/03/05 09:43:31    a     3       4
2013/03/07 00:00:00    b     4       4
2013/03/07 00:00:00    b     2       6
2013/03/07 00:00:00    a     0       4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With