Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to preserve datatype in DataFrame from an sklearn Transform (Imputer)

I have the following data below.

+----+-------------+----------+--------+------+-------+-------+---------+
| ID | PassengerId | Survived | Pclass | Age  | SibSp | Parch |  Fare   |
+----+-------------+----------+--------+------+-------+-------+---------+
|  0 |           1 |        0 |      3 | 22.0 |     1 |     0 | 7.2500  |
|  1 |           2 |        1 |      1 | 38.0 |     1 |     0 | 71.2833 |
|  2 |           3 |        1 |      3 | 26.0 |     0 |     0 | 7.9250  |
|  3 |           4 |        1 |      1 | 35.0 |     1 |     0 | 53.1000 |
|  4 |           5 |        0 |      3 | 35.0 |     0 |     0 | 8.0500  |
|  5 |           6 |        0 |      3 | NaN  |     0 |     0 | 8.4583  |
+----+-------------+----------+--------+------+-------+-------+---------+

After the transformation (via imputation) the datatypes assumingly from int/bool change into floats.

+----+-------------+----------+--------+-----------+-------+-------+---------+
| ID | PassengerId | Survived | Pclass |    Age    | SibSp | Parch |  Fare   |
+----+-------------+----------+--------+-----------+-------+-------+---------+
|  0 | 1.0         | 0.0      | 3.0    | 22.000000 | 1.0   | 0.0   | 7.2500  |
|  1 | 2.0         | 1.0      | 1.0    | 38.000000 | 1.0   | 0.0   | 71.2833 |
|  2 | 3.0         | 1.0      | 3.0    | 26.000000 | 0.0   | 0.0   | 7.9250  |
|  3 | 4.0         | 1.0      | 1.0    | 35.000000 | 1.0   | 0.0   | 53.1000 |
|  4 | 5.0         | 0.0      | 3.0    | 35.000000 | 0.0   | 0.0   | 8.0500  |
|  5 | 6.0         | 0.0      | 3.0    | 28.000000 | 0.0   | 0.0   | 8.4583  |
+----+-------------+----------+--------+-----------+-------+-------+---------+

My code is below:

import pandas as pd
import numpy as np

#https://www.kaggle.com/shivamp629/traincsv/downloads/traincsv.zip/1
data = pd.read_csv("train.csv")

data2 = data[['PassengerId', 'Survived','Pclass','Age','SibSp','Parch','Fare']].copy()

from sklearn.preprocessing import Imputer

fill_NaN = Imputer(missing_values=np.nan, strategy='median', axis=0)
data2_im = pd.DataFrame(fill_NaN.fit_transform(data2), columns = data2.columns)

data2_im

IS there a way to preserve the datatypes? Thanks for any help.

like image 554
Earl Avatar asked Oct 28 '25 11:10

Earl


1 Answers

The dtypes cannot be preserved, because sklearn extracts the underlying data from data2 before transforming and homogenises the dtypes to float for performance reasons.

You can always reinstate the initial dtypes using astype:

v = fill_NaN.fit_transform(data2)
df = pd.DataFrame(v, columns=data2.columns).astype(data2.dtypes.to_dict())
df

   PassengerId  Survived  Pclass   Age  SibSp  Parch     Fare
0            1         0       3  22.0      1      0   7.2500
1            2         1       1  38.0      1      0  71.2833
2            3         1       3  26.0      0      0   7.9250
3            4         1       1  35.0      1      0  53.1000
4            5         0       3  35.0      0      0   8.0500
5            6         0       3  35.0      0      0   8.4583
like image 123
cs95 Avatar answered Oct 31 '25 00:10

cs95



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!