I have the following time series dataframe. I would like to fill the missing values with the previous value. However i would only want to fill the missing values until a certain value is reached. This value is recorded in a different column. So the columns i wanna fill will be different for each row. How can i do this?
So, given this dataframe.
import numpy as np
import pandas as pd
df = pd.DataFrame([[1, 2 ,np.nan,np.nan,np.nan , 2009], [1, 3 , np.nan , np.nan , np.nan , 2010], [4, np.nan , 7 , np.nan,np.nan , 2011]], columns=[2007,2008,2009,2010,2011 , fill_until])
Input dataframe
2007 2008 2009 2010 2011 fill_until
1 2 NaN NaN NaN 2009
1 3 NaN NaN NaN 2010
4 Nan 7 NaN NaN 2011
Output dataframe:
2007 2008 2009 2010 2011
1 2 2 NaN NaN
1 3 3 3 NaN
4 4 7 7 7
Use ffill
+ where
-
m = df.columns[:-1].values <= df.fill_until.values[:, None]
df.iloc[:, :-1].ffill(axis=1).where(m)
2007 2008 2009 2010 2011
0 1.0 2.0 2.0 NaN NaN
1 1.0 3.0 3.0 3.0 NaN
2 4.0 4.0 7.0 7.0 7.0
Details
Use NumPy's broadcasting to obtain a mask of values to be filled upto based on the fill_until
column.
m = df.columns[:-1].values <= df.fill_until.values[:, None]
Or,
m = (df.columns[:-1].values[:, None] <= df.fill_until.values).T
m
array([[ True, True, True, False, False],
[ True, True, True, True, False],
[ True, True, True, True, True]], dtype=bool)
Now, slice out all but the last column, and call ffill
along the first axis -
i = df.iloc[:, :-1].ffill(axis=1)
i
2007 2008 2009 2010 2011
0 1.0 2.0 2.0 2.0 2.0
1 1.0 3.0 3.0 3.0 3.0
2 4.0 4.0 7.0 7.0 7.0
Now, use the previously computed mask m
to mask the values of i
using df.where
-
i.where(m)
2007 2008 2009 2010 2011
0 1.0 2.0 2.0 NaN NaN
1 1.0 3.0 3.0 3.0 NaN
2 4.0 4.0 7.0 7.0 7.0
Alternatively, use mask
, inverting m
-
i.mask(~m)
2007 2008 2009 2010 2011
0 1.0 2.0 2.0 NaN NaN
1 1.0 3.0 3.0 3.0 NaN
2 4.0 4.0 7.0 7.0 7.0
You can use:
fill_until
mask
and apply fillna
with method ffill
(same as ffill
)reset_index
and for same order of columns add reindex
df = pd.DataFrame([[1, 2 ,np.nan,np.nan,10 , 2009],
[1, 3 , np.nan , np.nan , np.nan , 2010],
[4, np.nan , 7 , np.nan,np.nan , 2011]],
columns=[2007,2008,2009,2010,2011 , 'fill_until'])
print (df)
2007 2008 2009 2010 2011 fill_until
0 1 2.0 NaN NaN 10.0 2009
1 1 3.0 NaN NaN NaN 2010
2 4 NaN 7.0 NaN NaN 2011
df1 = df.set_index('fill_until')
m = df1.columns.values <= df1.index.values[:, None]
print (m)
[[ True True True False False]
[ True True True True False]
[ True True True True True]]
df = df1.mask(m, df1.ffill(axis=1)).reset_index().reindex(columns=df.columns)
print (df)
2007 2008 2009 2010 2011 fill_until
0 1 2.0 2.0 NaN 10.0 2009
1 1 3.0 3.0 3.0 NaN 2010
2 4 4.0 7.0 7.0 7.0 2011
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With