Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Multiindex Rows and Columns: Replace NaN with Value from Matching Row

Given the following:

import pandas as pd
import numpy as np
df=pd.DataFrame({'County':['A','B','A','B','A','B','A','B','A','B'],
                'Hospital':['a','b','e','f','i','j','m','n','b','r'],
                'Enrollment':[44,55,95,54,81,54,89,76,1,67],
                'Year':['2012','2012','2012','2012','2012','2013',
                        '2013','2013','2013','2013']})
d2=pd.pivot_table(df,index=['County','Hospital'],columns=['Year'])#.sort_columns

d2

        Enrollment
       Year     2012    2013
County Hospital         
A       a       44.0    NaN
        b       NaN     1.0
        e       95.0    NaN
        i       81.0    NaN
        m       NaN     89.0
B       b       55.0    NaN
        f       54.0    NaN
        j       NaN     54.0
        n       NaN     76.0
        r       NaN     67.0

If a hospital such as 'b' exists more than once and it has no data for the previous year (the first occurrence of 'b'), I'd like to assign the previous year value of Enrollment for the other row ('b') and delete the row of 'b' that does not contain data for the first year like this:

        Enrollment
       Year     2012    2013
County Hospital         
A       a       44.0    NaN
        b       55.0    1.0
        e       95.0    NaN
        i       81.0    NaN
        m       NaN     89.0
B       f       54.0    NaN
        j       NaN     54.0
        n       NaN     76.0
        r       NaN     67.0

So far, I can identify the duplicate row and delete, but I'm just stuck on replacing NaN with a value where needed:

  1. Identify duplicate hospitals after resetting the index:

    d2=d2.reset_index()    
    d2['dup']=d2.duplicated('Hospital',keep=False)
    
  2. Flag, for deletion, the duplicate hospital with no data for the most recent year:

     Hospital=d2.columns.levels[0][1]
    
     Y1=d2.columns.levels[1][0]
    
     Y2=d2.columns.levels[1][1]
    
     d2['Delete']=np.nan
    
     d2.loc[(pd.isnull(d2.Enrollment[Y2]))&(d2['dup']==True),'Delete']='Yes'
    
  3. Keep all but rows to delete:

     d2=d2.loc[d2['Delete']!='Yes']
    
like image 901
Dance Party2 Avatar asked Nov 27 '25 07:11

Dance Party2


1 Answers

If I understand correctly, the problem is to copy values from County B into County A when the Hospital matches. That can be done with groupby/fillna(method='bfill'). The bfill method backfills NaNs with the closest succeeding non-NaN value.

Then, you can use d2.drop_duplicates(subset=['Hospital'], keep='first') to keep the first row when the Hospital matches.


For example,

import pandas as pd

df = pd.DataFrame({'County': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
                   'Hospital': ['a', 'b', 'e', 'f', 'i', 'j', 'm', 'n', 'b', 'r'],
                   'Enrollment': [44, 55, 95, 54, 81, 54, 89, 76, 1, 67],
                   'Year': ['2012', '2012', '2012', '2012', '2012', '2013',
                            '2013', '2013', '2013', '2013']})
d2 = pd.pivot_table(df, index=['County', 'Hospital'], columns=['Year'])  
d2 = d2.groupby(level='Hospital').fillna(method='bfill')
d2 = d2.reset_index()
d2 = d2.drop_duplicates(subset=['Hospital'], keep='first')

yields

     County Hospital Enrollment      
Year                       2012  2013
0         A        a       44.0   NaN
1         A        b       55.0   1.0
2         A        e       95.0   NaN
3         A        i       81.0   NaN
4         A        m        NaN  89.0
6         B        f       54.0   NaN
7         B        j        NaN  54.0
8         B        n        NaN  76.0
9         B        r        NaN  67.0
like image 50
unutbu Avatar answered Nov 29 '25 23:11

unutbu