I want to lag every column in a dataframe, by group. I have a frame like this:
import numpy as np
import pandas as pd
index = pd.date_range('2015-11-20', periods=6, freq='D')
df = pd.DataFrame(dict(time=index, grp=['A']*3 + ['B']*3, col1=[1,2,3]*2,
    col2=['a','b','c']*2)).set_index(['time','grp'])
which looks like
                col1 col2
time       grp           
2015-11-20 A       1    a
2015-11-21 A       2    b
2015-11-22 A       3    c
2015-11-23 B       1    a
2015-11-24 B       2    b
2015-11-25 B       3    c
and I want it to look like this:
                col1 col2 col1_lag col2_lag
time       grp                     
2015-11-20 A       1    a        2        b
2015-11-21 A       2    b        3        c
2015-11-22 A       3    c       NA       NA
2015-11-23 B       1    a        2        b
2015-11-24 B       2    b        3        c
2015-11-25 B       3    c       NA       NA
This question manages the result for a single column, but I have an arbitrary number of columns, and I want to lag all of them. I can use groupby and apply, but apply runs the shift function over each column independently, and it doesn't seem to like receiving an [nrow, 2] shaped dataframe in return. Is there perhaps a function like apply that acts on the whole group sub-frame? Or is there a better way to do this?
Pandas comes with a whole host of sql-like aggregation functions you can apply when grouping on one or more columns. This is Python's closest equivalent to dplyr's group_by + summarise logic.
Create lag variables, using the shift function. shift(1) creates a lag of a single record, while shift(5) creates a lag of five records. This creates a lag variable based on the prior observations, but shift can also take a time offset to specify the time to use in shift.
The Groupby Rolling function does not preserve the original index and so when dates are the same within the Group, it is impossible to know which index value it pertains to from the original dataframe.
IIUC, you can simply use level="grp" and then shift by -1:
>>> shifted = df.groupby(level="grp").shift(-1)
>>> df.join(shifted.rename(columns=lambda x: x+"_lag"))
                col1 col2  col1_lag col2_lag
time       grp                              
2015-11-20 A       1    a         2        b
2015-11-21 A       2    b         3        c
2015-11-22 A       3    c       NaN      NaN
2015-11-23 B       1    a         2        b
2015-11-24 B       2    b         3        c
2015-11-25 B       3    c       NaN      NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With