Missing values are a common problem in data analysis. One common strategy seems to be that missing values are replaced by values randomly sampled from the distribution of existing values.
Is there Python library code that conveniently performs this preprocessing step on a data frame? As far as I see the sklearn.preprocessing
module does not offer this strategy.
To sample from a distribution of existing values you need to know the distribution. If the distribution is not known you can use kernel density estimation to fit it. This blog post has a nice overview of kernel density estimation implementations for Python: http://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/.
There is an implementation in scikit-learn (see http://scikit-learn.org/stable/modules/density.html#kernel-density); sklearn's KernelDensity has .sample() method. There is also a kernel density estimator in statsmodels (http://statsmodels.sourceforge.net/devel/generated/statsmodels.nonparametric.kernel_density.KDEMultivariate.html); it supports categorical features.
Another method is to choose random existing values, without trying to generate values not seen in a dataset. The problem with this solution is that value could depend on other values in the same row, and random.sample without taking this in account may produce unrealistic examples.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With