Imputing missing values using sklearn IterativeImputer class for MICE

Question

I'm trying to learn how to implement MICE in imputing missing values for my datasets. I've heard about fancyimpute's MICE, but I also read that sklearn's IterativeImputer class can accomplish similar results. From sklearn's docs:

Our implementation of IterativeImputer was inspired by the R MICE package (Multivariate Imputation by Chained Equations) [1], but differs from it by returning a single imputation instead of multiple imputations. However, IterativeImputer can also be used for multiple imputations by applying it repeatedly to the same dataset with different random seeds when sample_posterior=True

I've seen "seeds" being used in different pipelines, but I never understood them well enough to implement them in my own code. I was wondering if anyone could explain and provide an example on how to implement seeds for a MICE imputation using sklearn's IterativeImputer? Thanks!

Stanislas Morbieu · Accepted Answer

IterativeImputer behavior can change depending on a random state. The random state which can be set is also called a "seed".

As stated by the documentation, we can get multiple imputations when setting sample_posterior to True and changing the random seeds, i.e. the parameter random_state.

Here is an example of how to use it:

import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

X_train = [[1, 2],
           [3, 6],
           [4, 8],
           [np.nan, 3],
           [7, np.nan]]
X_test = [[np.nan, 2],
          [np.nan, np.nan],
          [np.nan, 6]]

for i in range(3):
    imp = IterativeImputer(max_iter=10, random_state=i, sample_posterior=True)
    imp.fit(X_train)
    print(f"imputation {i}:")
    print(np.round(imp.transform(X_test)))

It outputs:

imputation 0:
[[ 1.  2.]
 [ 5. 10.]
 [ 3.  6.]]
imputation 1:
[[1. 2.]
 [0. 1.]
 [3. 6.]]
imputation 2:
[[1. 2.]
 [1. 2.]
 [3. 6.]]

We can observe the three different imputations.

GSA · Answer

A way to go about stacking the data might be to change @Stanislas' code around a bit like so:

mvi = {} # just my preference for dict, you can use a list too
# mvi collects each dataframe into a dict of dataframes using index: 0 thru 2

for i in range(3):
    imp = IterativeImputer(max_iter=10, random_state=i, sample_posterior=True)
    mvi[i] = np.round(imp.fit_transform(X_train))

combine the imputations into a single dataset using

# a. pandas concat, or 
pd.concat(list(dfImp.values()), axis=0)

#b. np stack
dfs = np.stack(list(dfImp.values()), axis=0)

pd.concat creates a 2D data, on the other hand,np.stack creates a 3D array that you can reshape into 2D. The breakdown of the numpy 3D is as follows:

axis 0: num of iterated dataframes
axis 1: len of original df (num of rows)
axis 2: num of columns in original dataframe

create a 2D from 3D

You can use numpy reshape like so:

np.reshape(dfs, newshape=(dfs.shape[0]*dfs.shape[1], -1))

which means you essentially multiply axis 0 by axis 1 to stack the dataframes into one big dataframe. The -1 at the end just means that whatever axes is left off, use that, in this case it is the columns.

Imputing missing values using sklearn IterativeImputer class for MICE

Tags:

python

dataframe

missing-data

sklearn-pandas

Glenn G.

2 Answers

Stanislas Morbieu

GSA

Recent Activity

Donate For Us

Imputing missing values using sklearn IterativeImputer class for MICE

Tags:

python

dataframe

missing-data

sklearn-pandas

Glenn G.

2 Answers

Stanislas Morbieu

GSA

Related questions

Recent Activity

Donate For Us