Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn imputer drop column with missing values

I am learning currently about sklearn imputer and I found that there is one strategy that isn't implemented by the imputers.

I would like to build a pipeline that deletes the columns with any missing values or delete all the rows with missing values.

Why do I want this?

Because I would like to do a grid search and find the defect of any imputing method on my RMSE or classification score.

Is there a way I can do this with sklearn pipeline? Or should I create my own imputer?

If this has been asked before, feel free to suggest closing the question and pointing me out to the correct resource.

For more context, I have 21 features and 1000 data points, only one column has missing values and those missing values are 50% of the values in the columns. I just want to explore the effect of the missing value imputation method on my classifier's accuracy and f1 score.

like image 353
Espoir Murhabazi Avatar asked Oct 15 '25 04:10

Espoir Murhabazi


1 Answers

I would suggest using autoimpute library. It's probably the best tool currently to deal with datasets that have missing values.

It has a function that does exactly what you asked, deletes rows with any missing values.

from autoimpute.imputations import MiceImputer, SingleImputer, listwise_delete

listwise_delete(df, inplace=True, verbose=False)

In general, sklearn's imputer is very limited in its usefulness and autoimpute is able to fill a lot of gaps. More specifically, it allows to:

  • Explicitly set columns that you would like to treat as variables in calculating the imputed values
  • Set different imputation algorithms for every column or a set of columns
si_dict_col = SingleImputer(
    strategy={"gender":"categorical", "salary": "pmm", "weight": "pmm"},
    predictors={"gender": ["salary", "weight", "looks"], "salary": ["weight", "gender"])

  • There are built-in methods to visualize different imputation method's results
plot_imp_scatter(data_het_miss, "x", "y", "least squares")

It also follows sklearn's patterns and can be substituted for sklearn's own imputer function in the pipeline.

like image 54
user4718221 Avatar answered Oct 17 '25 19:10

user4718221



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!