Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fill nulls until certain column value in Pandas

Tags:

python

pandas

I have the following time series dataframe. I would like to fill the missing values with the previous value. However i would only want to fill the missing values until a certain value is reached. This value is recorded in a different column. So the columns i wanna fill will be different for each row. How can i do this?

So, given this dataframe.

import numpy as np
import pandas as pd
df = pd.DataFrame([[1, 2 ,np.nan,np.nan,np.nan , 2009], [1, 3 , np.nan , np.nan , np.nan , 2010], [4, np.nan , 7 , np.nan,np.nan , 2011]], columns=[2007,2008,2009,2010,2011 , fill_until])

Input dataframe

2007    2008    2009    2010    2011  fill_until 
 1       2       NaN    NaN     NaN   2009
 1       3       NaN    NaN     NaN   2010
 4       Nan     7      NaN     NaN   2011    

Output dataframe:

2007    2008    2009    2010    2011
 1       2       2      NaN      NaN
 1       3       3        3      NaN
 4       4       7        7        7
like image 698
Iyad Al aqel Avatar asked Sep 19 '25 08:09

Iyad Al aqel


2 Answers

Use ffill + where -

m = df.columns[:-1].values <= df.fill_until.values[:, None]
df.iloc[:, :-1].ffill(axis=1).where(m)

   2007  2008  2009  2010  2011
0   1.0   2.0   2.0   NaN   NaN
1   1.0   3.0   3.0   3.0   NaN
2   4.0   4.0   7.0   7.0   7.0

Details

Use NumPy's broadcasting to obtain a mask of values to be filled upto based on the fill_until column.

m = df.columns[:-1].values <= df.fill_until.values[:, None]

Or,

m = (df.columns[:-1].values[:, None] <= df.fill_until.values).T

m    
array([[ True,  True,  True, False, False],
       [ True,  True,  True,  True, False],
       [ True,  True,  True,  True,  True]], dtype=bool)

Now, slice out all but the last column, and call ffill along the first axis -

i = df.iloc[:, :-1].ffill(axis=1)
i

   2007  2008  2009  2010  2011
0   1.0   2.0   2.0   2.0   2.0
1   1.0   3.0   3.0   3.0   3.0
2   4.0   4.0   7.0   7.0   7.0

Now, use the previously computed mask m to mask the values of i using df.where -

i.where(m)

   2007  2008  2009  2010  2011
0   1.0   2.0   2.0   NaN   NaN
1   1.0   3.0   3.0   3.0   NaN
2   4.0   4.0   7.0   7.0   7.0

Alternatively, use mask, inverting m -

i.mask(~m)

   2007  2008  2009  2010  2011
0   1.0   2.0   2.0   NaN   NaN
1   1.0   3.0   3.0   3.0   NaN
2   4.0   4.0   7.0   7.0   7.0
like image 189
cs95 Avatar answered Sep 21 '25 20:09

cs95


You can use:

  • first create index from column fill_until
  • create mask by numpy broadcasting
  • use mask and apply fillna with method ffill (same as ffill)
  • last reset_index and for same order of columns add reindex

df = pd.DataFrame([[1, 2 ,np.nan,np.nan,10 , 2009], 
                   [1, 3 , np.nan , np.nan , np.nan , 2010], 
                   [4, np.nan , 7 , np.nan,np.nan , 2011]], 
                   columns=[2007,2008,2009,2010,2011 , 'fill_until'])
print (df)
   2007  2008  2009  2010  2011  fill_until
0     1   2.0   NaN   NaN  10.0        2009
1     1   3.0   NaN   NaN   NaN        2010
2     4   NaN   7.0   NaN   NaN        2011

df1 = df.set_index('fill_until')
m = df1.columns.values <= df1.index.values[:, None]
print (m)
[[ True  True  True False False]
 [ True  True  True  True False]
 [ True  True  True  True  True]]

df =  df1.mask(m, df1.ffill(axis=1)).reset_index().reindex(columns=df.columns)
print (df)
   2007  2008  2009  2010  2011  fill_until
0     1   2.0   2.0   NaN  10.0        2009
1     1   3.0   3.0   3.0   NaN        2010
2     4   4.0   7.0   7.0   7.0        2011
like image 40
jezrael Avatar answered Sep 21 '25 20:09

jezrael