resample before pct_change() and missing values

Question

I have a dataframe:

import pandas as pd
df = pd.DataFrame([['A', 'G1', '2019-01-01', 11],
             ['A', 'G1', '2019-01-02', 12], 
             ['A', 'G1', '2019-01-04', 14], 
             ['B', 'G2', '2019-01-01', 11], 
             ['B', 'G2', '2019-01-03', 13], 
             ['B', 'G2', '2019-01-06', 16]], 
            columns=['cust', 'group', 'date', 'val'])
df

enter image description here

df = df.groupby(['cust', 'group', 'date']).sum()
df

enter image description here

The dataframe is grouped and now I would like to calculate pct_change, but only if there are previous date. If I do it like this:

df['pct'] = df.groupby(['cust', 'group']).val.pct_change()
df

enter image description here

I will get pct_change, but with no respect to the missing dates. For example in group ('A', 'G1'), pct for date 2019-01-04 should be np.nan because there is no (previous) date 2019-01-03.

Maybe the solution would be to resample by day, where each new row will have np.nan as val, and than to do pct_change.

I tried to use df.resample('1D', level=2) but than I get an error:

TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'MultiIndex'

For group ('B', 'G2') all pct_change should be np.nan because none of the rows has previous date.

Expected result is:

enter image description here

How to calculate pct_change respecting missing dates?

Solution:

new_df = pd.DataFrame()

for x, y  in df.groupby(['cust', 'group']):
    resampled=y.set_index('date').resample('D').val.mean().to_frame().rename({'val': 'resamp_val'}, axis=1) 
    resampled = resampled.join(y.set_index('date')).fillna({'cust':x[0],'group':x[1]})
    resampled['resamp_val_pct'] = resampled.resamp_val.pct_change(fill_method=None)

    new_df = pd.concat([new_df, resampled])

new_df = new_df[['cust', 'group', 'val', 'resamp_val', 'resamp_val_pct']]
new_df

enter image description here

BENY · Accepted Answer

Check with groupby, then you need resample first and get the pct change with Boolean mask, since pct_change will ignore NaN.

d={}
for x, y  in df.groupby(['cust', 'group']):
    s = y.set_index('date').resample('D').val.mean()
    d[x] = pd.concat([s, s.pct_change().mask(s.shift().isnull()|s.isnull())], 1)
newdf = pd.concat(d)
newdf.columns = ['val', 'pct']
newdf
Out[651]: 
                  val       pct
     date                      
A G1 2019-01-01  11.0       NaN
     2019-01-02  12.0  0.090909
     2019-01-03   NaN       NaN
     2019-01-04  14.0       NaN
B G2 2019-01-01  11.0       NaN
     2019-01-02   NaN       NaN
     2019-01-03  13.0       NaN
     2019-01-04   NaN       NaN
     2019-01-05   NaN       NaN
     2019-01-06  16.0       NaN

You can add reset_index(inplace=True) at the end to make all index back to columns

resample before pct_change() and missing values

Tags:

python

pandas

resampling

user3225309

1 Answers

BENY

Recent Activity

Donate For Us

resample before pct_change() and missing values

Tags:

python

pandas

resampling

user3225309

1 Answers

BENY

Related questions

Recent Activity

Donate For Us