Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handling missing categorical values ML

I have gone through replace missing values in categorical data regarding handling missing values in categorical data.

Dataset has about 6 categorical columns with missing values. This would be for a binary classification problem

I see different approaches where one is to just leave the missing values in category column as such, other to impute using from sklearn.preprocessing import Imputer, but unsure which is better option.

In case if imputing is better option, which libraries could I use before applying the model like LR,Decision Tree, RandomForest.

Thanks!

like image 208
pc_pyr Avatar asked Oct 24 '25 07:10

pc_pyr


2 Answers

There are multiple ways to handle missing data :

  • Some models take care of it (XGBoost, LightGBM for example)
  • You can try to impute them with a model. You should split your data in a train and test set, and try different models to measure which one works best. But more often that not, it doesnt' work very well. There is a KNNImputer implemented in sklearn
  • you can also define rules : set missing values to 0, the mean, median or whatever works, depending on your dataset. The is a SimpleImputer implemenetd in sklearn
  • if none of the above is working for you, you can also get rid of the lines with missing values.

More details on values imputing in sklearn : https://scikit-learn.org/stable/modules/impute.html

like image 137
CoMartel Avatar answered Oct 26 '25 21:10

CoMartel


Adding to @CoMartel,

  1. There exists no specific rule that can guarantee you good results. You need to check all the known ways one by one & observe your model's performance.

  2. But if the ratio of missing values is very high for a column (like >50% of the total rows. The threshold can also vary ), you should better drop that column.

  3. Also, if you have categorical data missing, you should try avoiding mean as suppose you encoded one of the categories as 1 & other as 2 but the mean is 2.5, it won't represent any category actually. The mode will be a better option than mean & median

like image 27
Mehul Gupta Avatar answered Oct 26 '25 21:10

Mehul Gupta