I'm trying to learn how to implement MICE in imputing missing values for my datasets. I've heard about fancyimpute's MICE, but I also read that sklearn's IterativeImputer class can accomplish similar results. From sklearn's docs:
Our implementation of IterativeImputer was inspired by the R MICE package (Multivariate Imputation by Chained Equations) [1], but differs from it by returning a single imputation instead of multiple imputations. However, IterativeImputer can also be used for multiple imputations by applying it repeatedly to the same dataset with different random seeds when sample_posterior=True
I've seen "seeds" being used in different pipelines, but I never understood them well enough to implement them in my own code. I was wondering if anyone could explain and provide an example on how to implement seeds for a MICE imputation using sklearn's IterativeImputer? Thanks!
IterativeImputer behavior can change depending on a random state. The random state which can be set is also called a "seed".
As stated by the documentation, we can get multiple imputations when setting sample_posterior to True and changing the random seeds, i.e. the parameter random_state.
Here is an example of how to use it:
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
X_train = [[1, 2],
[3, 6],
[4, 8],
[np.nan, 3],
[7, np.nan]]
X_test = [[np.nan, 2],
[np.nan, np.nan],
[np.nan, 6]]
for i in range(3):
imp = IterativeImputer(max_iter=10, random_state=i, sample_posterior=True)
imp.fit(X_train)
print(f"imputation {i}:")
print(np.round(imp.transform(X_test)))
It outputs:
imputation 0:
[[ 1. 2.]
[ 5. 10.]
[ 3. 6.]]
imputation 1:
[[1. 2.]
[0. 1.]
[3. 6.]]
imputation 2:
[[1. 2.]
[1. 2.]
[3. 6.]]
We can observe the three different imputations.
A way to go about stacking the data might be to change @Stanislas' code around a bit like so:
mvi = {} # just my preference for dict, you can use a list too
# mvi collects each dataframe into a dict of dataframes using index: 0 thru 2
for i in range(3):
imp = IterativeImputer(max_iter=10, random_state=i, sample_posterior=True)
mvi[i] = np.round(imp.fit_transform(X_train))
combine the imputations into a single dataset using
# a. pandas concat, or
pd.concat(list(dfImp.values()), axis=0)
#b. np stack
dfs = np.stack(list(dfImp.values()), axis=0)
pd.concat creates a 2D data, on the other hand,np.stack creates a 3D array that you can reshape into 2D. The breakdown of the numpy 3D is as follows:
create a 2D from 3D
You can use numpy reshape like so:
np.reshape(dfs, newshape=(dfs.shape[0]*dfs.shape[1], -1))
which means you essentially multiply axis 0 by axis 1 to stack the dataframes into one big dataframe. The -1 at the end just means that whatever axes is left off, use that, in this case it is the columns.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With