Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Interpolation still leaving NaN's (pandas groupby)

I have a Dataframe with the location of some customers (so I have a column with Customer_id and others with Lat and Lon) and I am trying to interpolate the NaN's according to each customer.

For example, if I interpolate with the nearest approach here (I made up the values here):

 Customer_id   Lat    Lon
   A            1      1
   A            NaN    NaN  
   A            2      2      
   B            NaN    NaN
   B            4      4

I would like the NaN for B to be 4 and not 2.

I have tried this

series.groupby('Customer_id').apply(lambda group: group.interpolate(method = 'nearest', limit_direction = 'both'))

And the number of NaN's goes down from 9003 to 94. But I'm not understanding why it is still leaving some missing values.

I checked and these 94 missing values corresponded to records from customers that were already being interpolated. For example,

    Customer_id   Lat
 0.    A           1
 1.    A           NaN
 2.    A           NaN
 3.    A           NaN
 4.    A           NaN

It would interpolate correctly until some value (let's say it interpolates 1, 2 and 3 correctly) and then leaves 4 as NaN.

I have tried to set a limit in interpolate greater than the maximum number of records per client but it is still not working out. I don't know where my mistake is, can somebody help out?

(I don't know if it's relevant to mention or not but I fabricated my own NaN's for this. This is the code I used Replace some values in a dataframe with NaN's if the index of the row does not exist in another dataframe I think the problem isn't here but since I'm very confused as to where the issue actually is I'll just leave it here)

like image 470
Nocas Avatar asked Oct 31 '25 05:10

Nocas


1 Answers

When you interpolate with nearest it is only able to fill in-between missing values. (You'll notice this because you get an error when there's only 1 non-null value, like in your example). The remaining null values are "edges" which are taken care of with .bfill().ffill() for the nearest logic. This is also the appropriate logic to "interpolate" with only one non-missing value.

def my_interp(x):
    if x.notnull().sum() > 1:
        return x.interpolate(method='nearest').ffill().bfill()
    else:
        return x.ffill().bfill()

df.groupby('Customer_id').transform(my_interp)

#   Lat  Lon
#0  1.0  1.0
#1  1.0  1.0
#2  2.0  2.0
#3  4.0  4.0
#4  4.0  4.0
like image 175
ALollz Avatar answered Nov 01 '25 20:11

ALollz