Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas conditional fillna based on another column values

I am working on bigmart dataset and I would like to substitute missing values of a column based on the values of another column, practically:

     Outlet_Size  sales_bin
0         Medium  3000-4000
1         Medium     0-1000
2         Medium  2000-3000
3            NaN     0-1000
4           High     0-1000
...          ...        ...
8518        High  2000-3000
8519         NaN     0-1000
8520       Small  1000-2000
8521      Medium  1000-2000
8522       Small     0-1000

So if train[“Outlet_Size”] value is a NaN and train[“sales_bin”] is  “0-1000”
train[“Outlet_Size”] value shoud become “Small”
else == Medium

But I really don’t know how to write it and all the information I found seems confusing to me

Is it possible to do it? How?

Many thanks

like image 804
Cesco83 Avatar asked Nov 01 '25 20:11

Cesco83


1 Answers

You can use pandas.Series.map instead of numpy.where.

pandas.Series.map seems to be handier for these simple cases, which makes multiple imputations easier and explicit with dictionaries (say {'0-1000': 'Small', '2000-3000': 'High'}).

numpy.where is designed to handle more logic (ex: if a < 5 then a^2) which is not very useful in the OP use case, but comes at some cost, like making multiple imputations tricky to handle (nested if-else).

Steps :

  1. Generate a mask to tag the subset of the pandas.DataFrame with missing 'Outlet_Size' using pandas.Series.isna() ;
  2. Define a dictionary with mappings, e.g. from '0-1000' to 'Small' ;
  3. Replace 'Outlet_Size' values in the defined pandas.DataFrame subset using pandas.Series.map with the defined dictionary as args argument.
  4. Use pandas.Series.fillna() to catch the unmapped missing 'Outlet_Size' and impute them to a default value.

Example :

import pandas as pd
import numpy as np
fake_dataframe = pd.DataFrame({
    'Outlet_Size' : ['Medium', 'Medium', 'Medium', np.nan, 'High', 'High', np.nan, 'Small', 'Medium', 'Small', np.nan, np.nan],
    'sales_bin': ['3000-4000', '0-1000', '2000-3000', '0-1000', '0-1000', '2000-3000', '0-1000', '1000-2000', '1000-2000', '0-1000', '2000-3000', '1000-2000']
})
missing_mask = fake_dataframe['Outlet_Size'].isna()
mapping_dict = dict({'0-1000': 'Small'})
fake_dataframe.loc[missing_mask, 'Outlet_Size'] = fake_dataframe.loc[missing_mask, 'sales_bin'].map(mapping_dict)
fake_dataframe['Outlet_Size'] = fake_dataframe['Outlet_Size'].fillna('Medium')
print(fake_dataframe)
   Outlet_Size  sales_bin
0       Medium  3000-4000
1       Medium     0-1000
2       Medium  2000-3000
3        Small     0-1000
4         High     0-1000
5         High  2000-3000
6        Small     0-1000
7        Small  1000-2000
8       Medium  1000-2000
9        Small     0-1000
10      Medium  2000-3000
11      Medium  1000-2000

Example with multiple imputations :

import pandas as pd
import numpy as np
fake_dataframe = pd.DataFrame({
    'Outlet_Size' : ['Medium', 'Medium', 'Medium', np.nan, 'High', 'High', np.nan, 'Small', 'Medium', 'Small', np.nan, np.nan],
    'sales_bin': ['3000-4000', '0-1000', '2000-3000', '0-1000', '0-1000', '2000-3000', '0-1000', '1000-2000', '1000-2000', '0-1000', '2000-3000', '1000-2000']
})
missing_mask = fake_dataframe['Outlet_Size'].isna()
mapping_dict = dict({'0-1000': 'Small', '2000-3000': 'High'})
fake_dataframe.loc[missing_mask, 'Outlet_Size'] = fake_dataframe.loc[missing_mask, 'sales_bin'].map(mapping_dict)
fake_dataframe['Outlet_Size'] = fake_dataframe['Outlet_Size'].fillna('Medium')
print(fake_dataframe)
    Outlet_Size sales_bin
0   Medium  3000-4000
1   Medium  0-1000
2   Medium  2000-3000
3   Small   0-1000
4   High    0-1000
5   High    2000-3000
6   Small   0-1000
7   Small   1000-2000
8   Medium  1000-2000
9   Small   0-1000
10  High    2000-3000
11  Medium  1000-2000
like image 157
SoufianeK Avatar answered Nov 03 '25 10:11

SoufianeK



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!