I use Z-score from scipy to normalize my dataset as the following:
import numpy as np
import pandas as pd
from scipy import stats
from scipy.stats import zscore
df = pd.DataFrame(pd.read_csv('dataset.csv', sep=','))
df = df.dropna(how='any') # drop nan entries
df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)] # remove outliers
print(df.describe())
df = df.apply(zscore) # Normalization
print(df.describe())
However, I get some columns changed to be NaN, particularly mta_tax and trip_type as you see below, yet they were numerical before applying Z-score normalization. Is this a bug in my code or Z-score can generate NaN?
VendorID RatecodeID PULocationID DOLocationID \
count 1.055286e+07 1.055286e+07 1.055286e+07 1.055286e+07
mean 1.794324e+00 1.000000e+00 1.106734e+02 1.285285e+02
std 4.041947e-01 4.353414e-04 7.541486e+01 7.729142e+01
min 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
25% 2.000000e+00 1.000000e+00 4.900000e+01 6.100000e+01
50% 2.000000e+00 1.000000e+00 8.200000e+01 1.290000e+02
75% 2.000000e+00 1.000000e+00 1.660000e+02 1.930000e+02
max 2.000000e+00 2.000000e+00 2.650000e+02 2.650000e+02
passenger_count trip_distance fare_amount extra mta_tax \
count 1.055286e+07 1.055286e+07 1.055286e+07 1.055286e+07 10552857.0
mean 1.140647e+00 2.399851e+00 1.082419e+01 3.607218e-01 0.5
std 4.436568e-01 2.014673e+00 6.464638e+00 3.797668e-01 0.0
min 0.000000e+00 0.000000e+00 0.000000e+00 -6.700000e-01 0.5
25% 1.000000e+00 1.000000e+00 6.000000e+00 0.000000e+00 0.5
50% 1.000000e+00 1.700000e+00 9.000000e+00 5.000000e-01 0.5
75% 1.000000e+00 3.120000e+00 1.350000e+01 5.000000e-01 0.5
max 4.000000e+00 1.117000e+01 4.100000e+01 1.000000e+00 0.5
tip_amount tolls_amount improvement_surcharge total_amount \
count 1.055286e+07 1.055286e+07 1.055286e+07 1.055286e+07
mean 1.028691e+00 5.512108e-02 3.000000e-01 1.312541e+01
std 1.510206e+00 5.524008e-01 4.110357e-11 7.370554e+00
min 0.000000e+00 0.000000e+00 3.000000e-01 0.000000e+00
25% 0.000000e+00 0.000000e+00 3.000000e-01 7.800000e+00
50% 0.000000e+00 0.000000e+00 3.000000e-01 1.080000e+01
75% 1.860000e+00 0.000000e+00 3.000000e-01 1.630000e+01
max 7.660000e+00 8.000000e+00 3.000000e-01 4.877000e+01
payment_type trip_type
count 1.055286e+07 10552857.0
mean 1.501672e+00 1.0
std 5.061254e-01 0.0
min 1.000000e+00 1.0
25% 1.000000e+00 1.0
50% 1.000000e+00 1.0
75% 2.000000e+00 1.0
max 3.000000e+00 1.0
VendorID RatecodeID PULocationID DOLocationID \
count 1.055286e+07 1.055286e+07 1.055286e+07 1.055286e+07
mean -1.235870e-12 1.006184e-13 -3.819625e-14 -1.004818e-14
std 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
min -1.965201e+00 -4.353414e-04 -1.454268e+00 -1.649970e+00
25% 5.088537e-01 -4.353414e-04 -8.177886e-01 -8.736870e-01
50% 5.088537e-01 -4.353414e-04 -3.802090e-01 6.100220e-03
75% 5.088537e-01 -4.353414e-04 7.336298e-01 8.341353e-01
max 5.088537e-01 2.297048e+03 2.046369e+00 1.765675e+00
passenger_count trip_distance fare_amount extra mta_tax \
count 1.055286e+07 1.055286e+07 1.055286e+07 1.055286e+07 0.0
mean -3.942620e-14 -3.206434e-14 -4.744100e-13 -1.042732e-12 NaN
std 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 NaN
min -2.571013e+00 -1.191187e+00 -1.674370e+00 -2.714092e+00 NaN
25% -3.170185e-01 -6.948283e-01 -7.462435e-01 -9.498508e-01 NaN
50% -3.170185e-01 -3.473773e-01 -2.821804e-01 3.667467e-01 NaN
75% -3.170185e-01 3.574519e-01 4.139144e-01 3.667467e-01 NaN
max 6.444966e+00 4.353138e+00 4.667827e+00 1.683344e+00 NaN
tip_amount tolls_amount improvement_surcharge total_amount \
count 1.055286e+07 1.055286e+07 10552857.0 1.055286e+07
mean 3.152945e-13 -2.877092e-14 -1.0 2.081611e-14
std 1.000000e+00 1.000000e+00 0.0 1.000000e+00
min -6.811593e-01 -9.978459e-02 -1.0 -1.780791e+00
25% -6.811593e-01 -9.978459e-02 -1.0 -7.225258e-01
50% -6.811593e-01 -9.978459e-02 -1.0 -3.155007e-01
75% 5.504607e-01 -9.978459e-02 -1.0 4.307119e-01
max 4.390996e+00 1.438246e+01 -1.0 4.836080e+00
payment_type trip_type
count 1.055286e+07 0.0
mean 1.387184e-12 NaN
std 1.000000e+00 NaN
min -9.912012e-01 NaN
25% -9.912012e-01 NaN
50% -9.912012e-01 NaN
75% 9.845937e-01 NaN
max 2.960389e+00 NaN
Thank you
To follow up on the comments, have a look at the following:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,2,2], 'c': [5,6,7], 'd':[8,8,8] })
Note how 'b' and 'd' are constant, which implies a standard deviation of 0. Applying the zscore-function means subtracting the mean and dividing by the standard error. If you divide a number by 0, the result is not defined, and Pandas will show NaN.
df.apply(zscore)
Out[8]:
a b c d
0 -1.224745 NaN -1.224745 NaN
1 0.000000 NaN 0.000000 NaN
2 1.224745 NaN 1.224745 NaN
You either need to apply the zscore function to selected columns, or change the function so that it omits constant columns. To apply the function only to selected columns, you can do the following:
df[['a','c']] = df[['a','c']].apply(zscore)
df
Out[9]:
a b c d
0 -1.224745 2 -1.224745 8
1 0.000000 2 0.000000 8
2 1.224745 2 1.224745 8
To let the function check if the column is constant, let's use a lambda function that returns the column unchanged if the column has a standard deviation of 0, or the standardized column otherwise.
df.apply(lambda x: x if np.std(x) == 0 else zscore(x))
Out[15]:
a b c d
0 -1.224745 2 -1.224745 8
1 0.000000 2 0.000000 8
2 1.224745 2 1.224745 8
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With