Impute missing values by sampling from the distribution of existing ones

Question

Missing values are a common problem in data analysis. One common strategy seems to be that missing values are replaced by values randomly sampled from the distribution of existing values.

Is there Python library code that conveniently performs this preprocessing step on a data frame? As far as I see the sklearn.preprocessing module does not offer this strategy.

Mikhail Korobov · Accepted Answer

To sample from a distribution of existing values you need to know the distribution. If the distribution is not known you can use kernel density estimation to fit it. This blog post has a nice overview of kernel density estimation implementations for Python: http://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/.

There is an implementation in scikit-learn (see http://scikit-learn.org/stable/modules/density.html#kernel-density); sklearn's KernelDensity has .sample() method. There is also a kernel density estimator in statsmodels (http://statsmodels.sourceforge.net/devel/generated/statsmodels.nonparametric.kernel_density.KDEMultivariate.html); it supports categorical features.

Another method is to choose random existing values, without trying to generate values not seen in a dataset. The problem with this solution is that value could depend on other values in the same row, and random.sample without taking this in account may produce unrealistic examples.

Impute missing values by sampling from the distribution of existing ones

Tags:

python

pandas

machine-learning

scikit-learn

data-science

clstaudt

1 Answers

Mikhail Korobov

Recent Activity

Donate For Us

Impute missing values by sampling from the distribution of existing ones

Tags:

python

pandas

machine-learning

scikit-learn

data-science

clstaudt

1 Answers

Mikhail Korobov

Related questions

Recent Activity

Donate For Us