Can pandas groupby use groupby.apply(func)
and inside the func
use another instance of .apply()
without duplicating and overwriting data?
In a way, the use of .apply()
is nested.
Python 3.7.3
pandas==0.25.1
import pandas as pd
def dummy_func_nested(row):
row['new_col_2'] = row['value'] * -1
return row
def dummy_func(df_group):
df_group['new_col_1'] = None
# apply dummy_func_nested
df_group = df_group.apply(dummy_func_nested, axis=1)
return df_group
def pandas_groupby():
# initialize data
df = pd.DataFrame([
{'country': 'US', 'value': 100.00, 'id': 'a'},
{'country': 'US', 'value': 95.00, 'id': 'b'},
{'country': 'CA', 'value': 56.00, 'id': 'y'},
{'country': 'CA', 'value': 40.00, 'id': 'z'},
])
# group by country and apply first dummy_func
new_df = df.groupby('country').apply(dummy_func)
# new_df and df should have the same list of countries
assert new_df['country'].tolist() == df['country'].tolist()
print(df)
if __name__ == '__main__':
pandas_groupby()
The above code should return
country value id new_col_1 new_col_2
0 US 100.0 a None -100.0
1 US 95.0 b None -95.0
2 CA 56.0 y None -56.0
3 CA 40.0 z None -40.0
However, the code returns
country value id new_col_1 new_col_2
0 US 100.0 a None -100.0
1 US 95.0 a None -95.0
2 US 56.0 a None -56.0
3 US 40.0 a None -40.0
This behavior only appears to happen when both groups have an equal amount of rows. If one group has more rows, then the output is as expected.
A quote from the documentation:
In the current implementation apply calls func twice on the first column/row to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects, as they will take effect twice for the first column/row.
Try changing the below code in your code:
def dummy_func(df_group):
df_group['new_col_1'] = None
# apply dummy_func_nested
df_group = df_group.apply(dummy_func_nested, axis=1)
return df_group
To:
def dummy_func(df_group):
df_group['new_col_1'] = None
# apply dummy_func_nested
df_group = dummy_func_nested(df_group)
return df_group
You don't need the apply
.
Of course, the more efficient way would be:
df['new_col_1'] = None
df['new_col_2'] = -df['value']
print(df)
Or:
print(df.assign(new_col_1=None, new_col_2=-df['value']))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With