Seeking Advice on Efficient Pandas Operations for Conditional Summing

Question

I am a newbie to Python and pandas and would appreciate any help I can get. I have the below code and would like to know whether there is a more efficient way to write it to improve performance. I tried using cumsum but it does not give me the same output.

Context: I need to calculate the total vesting_value_CAD for each trancheID by summing all trancheIDs having the same employeeID, groupName, and vesting_year with agreementDate <= the current trancheID's agreementDate, excluding the current row.

Code:

import pandas as pd
from datetime import datetime

# Create the dataframe
data = {
    'employeeID': [2, 2, 2, 2, 2, 2, 2],
    'groupName': ['A', 'A', 'A', 'A', 'A', 'B', 'A'],
    'agreementID': [7, 7, 1, 1, 8, 9, 6],
    'agreementDate': ['3/1/2025', '3/1/2025', '4/1/2025', '3/1/2025', '2/1/2025', '3/1/2025', '3/1/2025'],
    'trancheID': [28, 29, 26, 27, 30, 31, 32],
    'vesting_year': [2025, 2026, 2026, 2027, 2026, 2026, 2026],
    'vesting_value_CAD': [200, 300, 400, 500, 50, 30, 40]
}

df = pd.DataFrame(data)

# Convert agreementDate to datetime
df['agreementDate'] = pd.to_datetime(df['agreementDate'], format='%m/%d/%Y')

# Function to calculate total vesting_value_CAD for each trancheID
def calculate_total_vesting_value(row):
    # Filter the dataframe based on the conditions
    filtered_df = df[(df['employeeID'] == row['employeeID']) & 
                     (df['groupName'] == row['groupName']) & 
                     (df['vesting_year'] == row['vesting_year']) & 
                     (df['agreementDate'] <= row['agreementDate']) & 
                     (df['trancheID'] != row['trancheID'])]
    
    # Calculate the sum of vesting_value_CAD
    total_vesting_value = filtered_df['vesting_value_CAD'].sum()
    
    return total_vesting_value

# Apply the function 
df['total_vesting_value_CAD'] = df.apply(calculate_total_vesting_value, axis=1)


print(df)

ouroboros1 · Accepted Answer

Here's one approach:

cols = ['employeeID', 'groupName', 'vesting_year', 'agreementDate']

out = (
   df.merge(
       df.groupby(cols)['vesting_value_CAD'].sum()
         .groupby(cols[:-1]).cumsum()
         .rename('total_vesting_value_CAD'), 
       on=cols
   )
   .assign(
       total_vesting_value_CAD=lambda x: x['total_vesting_value_CAD'] 
       - x['vesting_value_CAD']
       )
   )

Output:

   employeeID groupName  agreementID agreementDate  trancheID  vesting_year  \
0           2         A            7    2025-03-01         28          2025   
1           2         A            7    2025-03-01         29          2026   
2           2         A            1    2025-04-01         26          2026   
3           2         A            1    2025-03-01         27          2027   
4           2         A            8    2025-02-01         30          2026   
5           2         B            9    2025-03-01         31          2026   
6           2         A            6    2025-03-01         32          2026   

   vesting_value_CAD  total_vesting_value_CAD  
0                200                        0  
1                300                       90  
2                400                      390  
3                500                        0  
4                 50                        0  
5                 30                        0  
6                 40                      350

Explanation / Intermediates

Start with .groupby + groupby.sum to get the sum per date:

cols = ['employeeID', 'groupName', 'vesting_year', 'agreementDate']

employeeID  groupName  vesting_year  agreementDate
2           A          2025          2025-03-01       200
                       2026          2025-02-01        50
                                     2025-03-01       340
                                     2025-04-01       400
                       2027          2025-03-01       500
            B          2026          2025-03-01        30
Name: vesting_value_CAD, dtype: int64

Now, chain a groupby on the same columns (now: index levels) except 'agreementDate' (cols[:-1]) + groupby.cumsum + Series.rename:

(df.groupby(cols)['vesting_value_CAD'].sum()
   .groupby(cols[:-1]).cumsum()
   .rename('total_vesting_value_CAD'))

employeeID  groupName  vesting_year  agreementDate
2           A          2025          2025-03-01       200
                       2026          2025-02-01        50
                                     2025-03-01       390
                                     2025-04-01       790
                       2027          2025-03-01       500
            B          2026          2025-03-01        30
Name: total_vesting_value_CAD, dtype: int64

Use .merge on cols to add to the original df and use .assign to correct column 'total_vesting_value_CAD' by subtracting column 'vesting_value_CAD'.

df.merge(...).loc[:, ['vesting_value_CAD', 'total_vesting_value_CAD']]

   vesting_value_CAD  total_vesting_value_CAD
0                200                      200 # 200 - 200 = 0
1                300                      390 # 390 - 300 = 90
2                400                      790 # 790 - 400 = 390
3                500                      500 # etc.
4                 50                       50
5                 30                       30
6                 40                      390

Seeking Advice on Efficient Pandas Operations for Conditional Summing

Tags:

python

pandas

Titi

1 Answers

ouroboros1

Recent Activity

Donate For Us

Seeking Advice on Efficient Pandas Operations for Conditional Summing

Tags:

python

pandas

Titi

1 Answers

ouroboros1

Related questions

Recent Activity

Donate For Us