Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why normalization using Z-score in Pandas dataframe generates NaN columns?

I use Z-score from scipy to normalize my dataset as the following:

import numpy as np
import pandas as pd
from scipy import stats
from scipy.stats import zscore

df = pd.DataFrame(pd.read_csv('dataset.csv', sep=','))
df = df.dropna(how='any') # drop nan entries
df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)] # remove outliers

print(df.describe())
df = df.apply(zscore) # Normalization
print(df.describe())

However, I get some columns changed to be NaN, particularly mta_tax and trip_type as you see below, yet they were numerical before applying Z-score normalization. Is this a bug in my code or Z-score can generate NaN?

Before Normalization:

           VendorID    RatecodeID  PULocationID  DOLocationID  \
count  1.055286e+07  1.055286e+07  1.055286e+07  1.055286e+07   
mean   1.794324e+00  1.000000e+00  1.106734e+02  1.285285e+02   
std    4.041947e-01  4.353414e-04  7.541486e+01  7.729142e+01   
min    1.000000e+00  1.000000e+00  1.000000e+00  1.000000e+00   
25%    2.000000e+00  1.000000e+00  4.900000e+01  6.100000e+01   
50%    2.000000e+00  1.000000e+00  8.200000e+01  1.290000e+02   
75%    2.000000e+00  1.000000e+00  1.660000e+02  1.930000e+02   
max    2.000000e+00  2.000000e+00  2.650000e+02  2.650000e+02   

       passenger_count  trip_distance   fare_amount         extra     mta_tax  \
count     1.055286e+07   1.055286e+07  1.055286e+07  1.055286e+07  10552857.0   
mean      1.140647e+00   2.399851e+00  1.082419e+01  3.607218e-01         0.5   
std       4.436568e-01   2.014673e+00  6.464638e+00  3.797668e-01         0.0   
min       0.000000e+00   0.000000e+00  0.000000e+00 -6.700000e-01         0.5   
25%       1.000000e+00   1.000000e+00  6.000000e+00  0.000000e+00         0.5   
50%       1.000000e+00   1.700000e+00  9.000000e+00  5.000000e-01         0.5   
75%       1.000000e+00   3.120000e+00  1.350000e+01  5.000000e-01         0.5   
max       4.000000e+00   1.117000e+01  4.100000e+01  1.000000e+00         0.5   

         tip_amount  tolls_amount  improvement_surcharge  total_amount  \
count  1.055286e+07  1.055286e+07           1.055286e+07  1.055286e+07   
mean   1.028691e+00  5.512108e-02           3.000000e-01  1.312541e+01   
std    1.510206e+00  5.524008e-01           4.110357e-11  7.370554e+00   
min    0.000000e+00  0.000000e+00           3.000000e-01  0.000000e+00   
25%    0.000000e+00  0.000000e+00           3.000000e-01  7.800000e+00   
50%    0.000000e+00  0.000000e+00           3.000000e-01  1.080000e+01   
75%    1.860000e+00  0.000000e+00           3.000000e-01  1.630000e+01   
max    7.660000e+00  8.000000e+00           3.000000e-01  4.877000e+01   

       payment_type   trip_type  
count  1.055286e+07  10552857.0  
mean   1.501672e+00         1.0  
std    5.061254e-01         0.0  
min    1.000000e+00         1.0  
25%    1.000000e+00         1.0  
50%    1.000000e+00         1.0  
75%    2.000000e+00         1.0  
max    3.000000e+00         1.0 

After Normalization:

           VendorID    RatecodeID  PULocationID  DOLocationID  \
count  1.055286e+07  1.055286e+07  1.055286e+07  1.055286e+07   
mean  -1.235870e-12  1.006184e-13 -3.819625e-14 -1.004818e-14   
std    1.000000e+00  1.000000e+00  1.000000e+00  1.000000e+00   
min   -1.965201e+00 -4.353414e-04 -1.454268e+00 -1.649970e+00   
25%    5.088537e-01 -4.353414e-04 -8.177886e-01 -8.736870e-01   
50%    5.088537e-01 -4.353414e-04 -3.802090e-01  6.100220e-03   
75%    5.088537e-01 -4.353414e-04  7.336298e-01  8.341353e-01   
max    5.088537e-01  2.297048e+03  2.046369e+00  1.765675e+00   

       passenger_count  trip_distance   fare_amount         extra  mta_tax  \
count     1.055286e+07   1.055286e+07  1.055286e+07  1.055286e+07      0.0   
mean     -3.942620e-14  -3.206434e-14 -4.744100e-13 -1.042732e-12      NaN   
std       1.000000e+00   1.000000e+00  1.000000e+00  1.000000e+00      NaN   
min      -2.571013e+00  -1.191187e+00 -1.674370e+00 -2.714092e+00      NaN   
25%      -3.170185e-01  -6.948283e-01 -7.462435e-01 -9.498508e-01      NaN   
50%      -3.170185e-01  -3.473773e-01 -2.821804e-01  3.667467e-01      NaN   
75%      -3.170185e-01   3.574519e-01  4.139144e-01  3.667467e-01      NaN   
max       6.444966e+00   4.353138e+00  4.667827e+00  1.683344e+00      NaN   

         tip_amount  tolls_amount  improvement_surcharge  total_amount  \
count  1.055286e+07  1.055286e+07             10552857.0  1.055286e+07   
mean   3.152945e-13 -2.877092e-14                   -1.0  2.081611e-14   
std    1.000000e+00  1.000000e+00                    0.0  1.000000e+00   
min   -6.811593e-01 -9.978459e-02                   -1.0 -1.780791e+00   
25%   -6.811593e-01 -9.978459e-02                   -1.0 -7.225258e-01   
50%   -6.811593e-01 -9.978459e-02                   -1.0 -3.155007e-01   
75%    5.504607e-01 -9.978459e-02                   -1.0  4.307119e-01   
max    4.390996e+00  1.438246e+01                   -1.0  4.836080e+00   

       payment_type  trip_type  
count  1.055286e+07        0.0  
mean   1.387184e-12        NaN  
std    1.000000e+00        NaN  
min   -9.912012e-01        NaN  
25%   -9.912012e-01        NaN  
50%   -9.912012e-01        NaN  
75%    9.845937e-01        NaN  
max    2.960389e+00        NaN 

Thank you

like image 328
steve Avatar asked Dec 20 '25 17:12

steve


1 Answers

To follow up on the comments, have a look at the following:

df = pd.DataFrame({'a': [1,2,3], 'b': [2,2,2], 'c': [5,6,7], 'd':[8,8,8] })

Note how 'b' and 'd' are constant, which implies a standard deviation of 0. Applying the zscore-function means subtracting the mean and dividing by the standard error. If you divide a number by 0, the result is not defined, and Pandas will show NaN.

df.apply(zscore)
Out[8]: 
          a   b         c   d
0 -1.224745 NaN -1.224745 NaN
1  0.000000 NaN  0.000000 NaN
2  1.224745 NaN  1.224745 NaN

You either need to apply the zscore function to selected columns, or change the function so that it omits constant columns. To apply the function only to selected columns, you can do the following:

df[['a','c']] = df[['a','c']].apply(zscore)

df
Out[9]: 
          a  b         c  d
0 -1.224745  2 -1.224745  8
1  0.000000  2  0.000000  8
2  1.224745  2  1.224745  8

To let the function check if the column is constant, let's use a lambda function that returns the column unchanged if the column has a standard deviation of 0, or the standardized column otherwise.

df.apply(lambda x: x if np.std(x) == 0 else zscore(x))
Out[15]: 
          a  b         c  d
0 -1.224745  2 -1.224745  8
1  0.000000  2  0.000000  8
2  1.224745  2  1.224745  8
like image 100
coffeinjunky Avatar answered Dec 23 '25 05:12

coffeinjunky



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!