I am learning currently about sklearn imputer and I found that there is one strategy that isn't implemented by the imputers.
I would like to build a pipeline that deletes the columns with any missing values or delete all the rows with missing values.
Why do I want this?
Because I would like to do a grid search and find the defect of any imputing method on my RMSE or classification score.
Is there a way I can do this with sklearn pipeline? Or should I create my own imputer?
If this has been asked before, feel free to suggest closing the question and pointing me out to the correct resource.
For more context, I have 21 features and 1000 data points, only one column has missing values and those missing values are 50% of the values in the columns. I just want to explore the effect of the missing value imputation method on my classifier's accuracy and f1 score.
I would suggest using autoimpute library. It's probably the best tool currently to deal with datasets that have missing values.
It has a function that does exactly what you asked, deletes rows with any missing values.
from autoimpute.imputations import MiceImputer, SingleImputer, listwise_delete
listwise_delete(df, inplace=True, verbose=False)
In general, sklearn's imputer is very limited in its usefulness and autoimpute is able to fill a lot of gaps. More specifically, it allows to:
si_dict_col = SingleImputer(
strategy={"gender":"categorical", "salary": "pmm", "weight": "pmm"},
predictors={"gender": ["salary", "weight", "looks"], "salary": ["weight", "gender"])
plot_imp_scatter(data_het_miss, "x", "y", "least squares")
It also follows sklearn's patterns and can be substituted for sklearn's own imputer function in the pipeline.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With