I have a dataframe:
import pandas as pd
df = pd.DataFrame([['A', 'G1', '2019-01-01', 11],
['A', 'G1', '2019-01-02', 12],
['A', 'G1', '2019-01-04', 14],
['B', 'G2', '2019-01-01', 11],
['B', 'G2', '2019-01-03', 13],
['B', 'G2', '2019-01-06', 16]],
columns=['cust', 'group', 'date', 'val'])
df
df = df.groupby(['cust', 'group', 'date']).sum()
df
The dataframe is grouped and now I would like to calculate pct_change
, but only if there are previous date.
If I do it like this:
df['pct'] = df.groupby(['cust', 'group']).val.pct_change()
df
I will get pct_change
, but with no respect to the missing dates.
For example in group ('A', 'G1')
, pct
for date 2019-01-04
should be np.nan
because there is no (previous) date 2019-01-03
.
Maybe the solution would be to resample by day, where each new row will have np.nan
as val
, and than to do pct_change
.
I tried to use df.resample('1D', level=2)
but than I get an error:
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'MultiIndex'
For group ('B', 'G2')
all pct_change
should be np.nan
because none of the rows has previous date.
Expected result is:
How to calculate pct_change
respecting missing dates?
Solution:
new_df = pd.DataFrame()
for x, y in df.groupby(['cust', 'group']):
resampled=y.set_index('date').resample('D').val.mean().to_frame().rename({'val': 'resamp_val'}, axis=1)
resampled = resampled.join(y.set_index('date')).fillna({'cust':x[0],'group':x[1]})
resampled['resamp_val_pct'] = resampled.resamp_val.pct_change(fill_method=None)
new_df = pd.concat([new_df, resampled])
new_df = new_df[['cust', 'group', 'val', 'resamp_val', 'resamp_val_pct']]
new_df
Check with groupby
, then you need resample
first and get the pct change with Boolean mask, since pct_change will ignore NaN
.
d={}
for x, y in df.groupby(['cust', 'group']):
s = y.set_index('date').resample('D').val.mean()
d[x] = pd.concat([s, s.pct_change().mask(s.shift().isnull()|s.isnull())], 1)
newdf = pd.concat(d)
newdf.columns = ['val', 'pct']
newdf
Out[651]:
val pct
date
A G1 2019-01-01 11.0 NaN
2019-01-02 12.0 0.090909
2019-01-03 NaN NaN
2019-01-04 14.0 NaN
B G2 2019-01-01 11.0 NaN
2019-01-02 NaN NaN
2019-01-03 13.0 NaN
2019-01-04 NaN NaN
2019-01-05 NaN NaN
2019-01-06 16.0 NaN
You can add reset_index(inplace=True)
at the end to make all index back to columns
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With