Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Impute missing values by sampling from the distribution of existing ones

Missing values are a common problem in data analysis. One common strategy seems to be that missing values are replaced by values randomly sampled from the distribution of existing values.

Is there Python library code that conveniently performs this preprocessing step on a data frame? As far as I see the sklearn.preprocessing module does not offer this strategy.

like image 415
clstaudt Avatar asked Sep 07 '25 07:09

clstaudt


1 Answers

To sample from a distribution of existing values you need to know the distribution. If the distribution is not known you can use kernel density estimation to fit it. This blog post has a nice overview of kernel density estimation implementations for Python: http://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/.

There is an implementation in scikit-learn (see http://scikit-learn.org/stable/modules/density.html#kernel-density); sklearn's KernelDensity has .sample() method. There is also a kernel density estimator in statsmodels (http://statsmodels.sourceforge.net/devel/generated/statsmodels.nonparametric.kernel_density.KDEMultivariate.html); it supports categorical features.

Another method is to choose random existing values, without trying to generate values not seen in a dataset. The problem with this solution is that value could depend on other values in the same row, and random.sample without taking this in account may produce unrealistic examples.

like image 171
Mikhail Korobov Avatar answered Sep 08 '25 19:09

Mikhail Korobov