I have gone through replace missing values in categorical data regarding handling missing values in categorical data.
Dataset has about 6 categorical columns with missing values. This would be for a binary classification problem
I see different approaches where one is to just leave the missing values in category column as such, other to impute using from sklearn.preprocessing import Imputer, but unsure which is better option.
In case if imputing is better option, which libraries could I use before applying the model like LR,Decision Tree, RandomForest.
Thanks!
There are multiple ways to handle missing data :
More details on values imputing in sklearn : https://scikit-learn.org/stable/modules/impute.html
Adding to @CoMartel,
There exists no specific rule that can guarantee you good results. You need to check all the known ways one by one & observe your model's performance.
But if the ratio of missing values is very high for a column (like >50% of the total rows. The threshold can also vary ), you should better drop that column.
Also, if you have categorical data missing, you should try avoiding mean as suppose you encoded one of the categories as 1 & other as 2 but the mean is 2.5, it won't represent any category actually. The mode will be a better option than mean & median
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With