I am a newbie to Python and pandas and would appreciate any help I can get. I have the below code and would like to know whether there is a more efficient way to write it to improve performance. I tried using cumsum but it does not give me the same output.
Context: I need to calculate the total vesting_value_CAD for each trancheID by summing all trancheIDs having the same employeeID, groupName, and vesting_year with agreementDate <= the current trancheID's agreementDate, excluding the current row.
Code:
import pandas as pd
from datetime import datetime
# Create the dataframe
data = {
'employeeID': [2, 2, 2, 2, 2, 2, 2],
'groupName': ['A', 'A', 'A', 'A', 'A', 'B', 'A'],
'agreementID': [7, 7, 1, 1, 8, 9, 6],
'agreementDate': ['3/1/2025', '3/1/2025', '4/1/2025', '3/1/2025', '2/1/2025', '3/1/2025', '3/1/2025'],
'trancheID': [28, 29, 26, 27, 30, 31, 32],
'vesting_year': [2025, 2026, 2026, 2027, 2026, 2026, 2026],
'vesting_value_CAD': [200, 300, 400, 500, 50, 30, 40]
}
df = pd.DataFrame(data)
# Convert agreementDate to datetime
df['agreementDate'] = pd.to_datetime(df['agreementDate'], format='%m/%d/%Y')
# Function to calculate total vesting_value_CAD for each trancheID
def calculate_total_vesting_value(row):
# Filter the dataframe based on the conditions
filtered_df = df[(df['employeeID'] == row['employeeID']) &
(df['groupName'] == row['groupName']) &
(df['vesting_year'] == row['vesting_year']) &
(df['agreementDate'] <= row['agreementDate']) &
(df['trancheID'] != row['trancheID'])]
# Calculate the sum of vesting_value_CAD
total_vesting_value = filtered_df['vesting_value_CAD'].sum()
return total_vesting_value
# Apply the function
df['total_vesting_value_CAD'] = df.apply(calculate_total_vesting_value, axis=1)
print(df)
Here's one approach:
cols = ['employeeID', 'groupName', 'vesting_year', 'agreementDate']
out = (
df.merge(
df.groupby(cols)['vesting_value_CAD'].sum()
.groupby(cols[:-1]).cumsum()
.rename('total_vesting_value_CAD'),
on=cols
)
.assign(
total_vesting_value_CAD=lambda x: x['total_vesting_value_CAD']
- x['vesting_value_CAD']
)
)
Output:
employeeID groupName agreementID agreementDate trancheID vesting_year \
0 2 A 7 2025-03-01 28 2025
1 2 A 7 2025-03-01 29 2026
2 2 A 1 2025-04-01 26 2026
3 2 A 1 2025-03-01 27 2027
4 2 A 8 2025-02-01 30 2026
5 2 B 9 2025-03-01 31 2026
6 2 A 6 2025-03-01 32 2026
vesting_value_CAD total_vesting_value_CAD
0 200 0
1 300 90
2 400 390
3 500 0
4 50 0
5 30 0
6 40 350
Explanation / Intermediates
.groupby + groupby.sum to get the sum per date:cols = ['employeeID', 'groupName', 'vesting_year', 'agreementDate']
employeeID groupName vesting_year agreementDate
2 A 2025 2025-03-01 200
2026 2025-02-01 50
2025-03-01 340
2025-04-01 400
2027 2025-03-01 500
B 2026 2025-03-01 30
Name: vesting_value_CAD, dtype: int64
groupby on the same columns (now: index levels) except 'agreementDate' (cols[:-1]) + groupby.cumsum + Series.rename:(df.groupby(cols)['vesting_value_CAD'].sum()
.groupby(cols[:-1]).cumsum()
.rename('total_vesting_value_CAD'))
employeeID groupName vesting_year agreementDate
2 A 2025 2025-03-01 200
2026 2025-02-01 50
2025-03-01 390
2025-04-01 790
2027 2025-03-01 500
B 2026 2025-03-01 30
Name: total_vesting_value_CAD, dtype: int64
.merge on cols to add to the original df and use .assign to correct column 'total_vesting_value_CAD' by subtracting column 'vesting_value_CAD'.df.merge(...).loc[:, ['vesting_value_CAD', 'total_vesting_value_CAD']]
vesting_value_CAD total_vesting_value_CAD
0 200 200 # 200 - 200 = 0
1 300 390 # 390 - 300 = 90
2 400 790 # 790 - 400 = 390
3 500 500 # etc.
4 50 50
5 30 30
6 40 390
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With