I am surprised that sklearn.preprocessing.Imputer does not offer the following strategy for filling missing values: For any missing value, sample uniformly at random one value from the given values and replace.
I assume that this is a better strategy than replacing with the mean, the most frequent or the median value, as it does not produce an artificial spike in the distribution of values.
Do I need to write a transformer that does this myself?
ps. On a more meta-level of discussion, I am always a bit puzzled when I do not find what I consider a straightforward, almost standard operation as a component in a library like scikit-learn
. Makes me wonder: Is this library very unfinished or am I trying to do something that is against best practices? Any advice?
I am arriving a bit late to this discussion, but since I saw it, I thought I'd add my 2 cent.
With the open-source Python library Feature-engine, we can perform random sample imputation right away. I leave here a link to the RandomSampleImputer.
In the following snippet I show that the functionality is very similar to that of Scikit-learn transformers:
import pandas as pd
import numpy as np
from feature_engine.imputation import RandomSampleImputer
X = pd.DataFrame(dict(
x1 = [np.nan,1,1,0,np.nan],
x2 = ["a", np.nan, "b", np.nan, "a"],
))
rsi = RandomSampleImputer()
rsi.fit(X)
rsi.transform(X)
The output will be:
x1 x2
0 1.0 a
1 1.0 b
2 1.0 b
3 0.0 a
4 1.0 a
where the missing data were replaced by random samples extracted from the original variables (where values were available).
Regarding this method not being included in Scikit-learn, the developers like including methods that are well (scientifically, when possible) documented. and this is not one of them.
I'd also say that this is not a standard method of imputation. When talking about simple univariate methods, the most widely used are mean, median, mode, and arbitrary imputation.
This method does preserve the variable distribution (for all distributions), because a random sample of a variable will show the same distribution by definition. But, on the downside, it has an element of randomness that is difficult to account for, particularly, when we want to put the models into production.
In an extreme example, say we have 2 patients that show identical data in 9 out of 10 variables, and the value in the 10th variable is missing. When we carry out random imputation, most likely, each patient will receive a different value, which will, in turn, lead to a different prediction. And, this is a no-go, in terms of being fair to our customers. In short, we would be offering different solutions, to patients that show identical characteristics.
In addition, when or if we put the model into production, in order to extract random samples, we would have to store a copy of the training dataset, which, if big, could be quite memory-intensive.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With