Handling missing categorical values ML

Question

I have gone through replace missing values in categorical data regarding handling missing values in categorical data.

Dataset has about 6 categorical columns with missing values. This would be for a binary classification problem

I see different approaches where one is to just leave the missing values in category column as such, other to impute using from sklearn.preprocessing import Imputer, but unsure which is better option.

In case if imputing is better option, which libraries could I use before applying the model like LR,Decision Tree, RandomForest.

Thanks!

CoMartel · Accepted Answer

There are multiple ways to handle missing data :

Some models take care of it (XGBoost, LightGBM for example)
You can try to impute them with a model. You should split your data in a train and test set, and try different models to measure which one works best. But more often that not, it doesnt' work very well. There is a KNNImputer implemented in sklearn
you can also define rules : set missing values to 0, the mean, median or whatever works, depending on your dataset. The is a SimpleImputer implemenetd in sklearn
if none of the above is working for you, you can also get rid of the lines with missing values.

More details on values imputing in sklearn : https://scikit-learn.org/stable/modules/impute.html

Mehul Gupta · Answer

Adding to @CoMartel,

There exists no specific rule that can guarantee you good results. You need to check all the known ways one by one & observe your model's performance.
But if the ratio of missing values is very high for a column (like >50% of the total rows. The threshold can also vary ), you should better drop that column.
Also, if you have categorical data missing, you should try avoiding mean as suppose you encoded one of the categories as 1 & other as 2 but the mean is 2.5, it won't represent any category actually. The mode will be a better option than mean & median

Handling missing categorical values ML

Tags:

python

missing-data

machine-learning

imputation

classification

pc_pyr

2 Answers

CoMartel

Mehul Gupta

Recent Activity

Donate For Us

Handling missing categorical values ML

Tags:

python

missing-data

machine-learning

imputation

classification

pc_pyr

2 Answers

CoMartel

Mehul Gupta

Related questions

Recent Activity

Donate For Us