I have a column in dataframe that has categorical data but some of the data is missing i.e. NaN. I want to carry out linear interpolation on this data to fill the missing values but am not sure how to go about it. I can't drop the NaNs to turn the data into a categorical type because I need to fill them. A simple example to demonstrate what am trying to do.
col1  col2
5     cloudy
3     windy
6     NaN
7     rainy
10    NaN
Say I want to convert  col2 to categorical data but retain the NaNs and fill them using linear interpolation how do I go about it. Lets say after converting the column to categorical data it looks like this
col2
1
2
NaN
3
NaN
Then I can do linear interpolation and get something like this
col2
1
2
3
3
2
How can I achieve this?
Step 1: Find which category occurred most in each category using mode(). Step 2: Replace all NAN values in that column with that category. Step 3: Drop original columns and keep newly imputed columns. Advantage: Simple and easy to implement for categorical variables/columns.
The basic strategy is to convert each category value into a new column and assign a 1 or 0 (True/False) value to the column. This has the benefit of not weighting a value improperly. There are many libraries out there that support one-hot encoding but the simplest one is using pandas ' . get_dummies() method.
Imputation Method 1: Most Common Class One approach to imputing categorical features is to replace missing values with the most common class. You can do with by taking the index of the most common feature given in Pandas' value_counts function.
UPDATE:
Is there a way to convert the data back to its original form after interpolation ie instead of 1,2 or 3 you have cloudy,windy and rainy again?
Solution: I've intentionally added more rows to your original DF:
In [129]: df
Out[129]:
   col1    col2
0     5  cloudy
1     3   windy
2     6     NaN
3     7   rainy
4    10     NaN
5     5  cloudy
6    10     NaN
7     7   rainy
In [130]: df.dtypes
Out[130]:
col1       int64
col2    category
dtype: object
In [131]: df.col2 = (df.col2.cat.codes.replace(-1, np.nan)
     ...:              .interpolate().astype(int).astype('category')
     ...:              .cat.rename_categories(df.col2.cat.categories))
     ...:
In [132]: df
Out[132]:
   col1    col2
0     5  cloudy
1     3   windy
2     6   rainy
3     7   rainy
4    10  cloudy
5     5  cloudy
6    10  cloudy
7     7   rainy
OLD "numerical" answer:
IIUC you can do this:
In [66]: df
Out[66]:
   col1    col2
0     5  cloudy
1     3   windy
2     6     NaN
3     7   rainy
4    10     NaN
first let's factorize col2:
In [67]: df.col2 = pd.factorize(df.col2, na_sentinel=-2)[0] + 1
In [68]: df
Out[68]:
   col1  col2
0     5     1
1     3     2
2     6    -1
3     7     3
4    10    -1
now we can interpolate it (replacing -1's with NaN's):
In [69]: df.col2.replace(-1, np.nan).interpolate().astype(int)
Out[69]:
0    1
1    2
2    2
3    3
4    3
Name: col2, dtype: int32
the same approach, but converting interpolated series to category dtype:
In [70]: df.col2.replace(-1, np.nan).interpolate().astype(int).astype('category')
Out[70]:
0    1
1    2
2    2
3    3
4    3
Name: col2, dtype: category
Categories (3, int64): [1, 2, 3]
I know your asking for linear interpolation but this is just another way if you want to do this easier.As converting categories to Numbers isn't such a good idea I suggest this one.
you can simply use the interpolation method in pandas library with method 'pad' like:
df.interpolate(method='pad')
you can also see other methods and example of using them in here. (link is the pandas documentation of interpolation)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With