Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - How to identify `nan` values in a Series

Tags:

python

pandas

I am currently playing with Kaggle Titanic dataset (train.csv)

  1. I can load the data fine.
  2. I understood that some data in Embarked column has nan value. But when I tried to filter it using the following code, I am getting an empty array
    import pandas as pd
    df = df.read_csv(<file_loc>, header=0)
    df[df.Embarked == 'nan']

I tried to import numpy.nan to replace the string nan above. But it doesn't work.

What am I trying to find - is all the cells which are not 'S', 'C', 'Q'.

Also realised later that.... the nan is a Float type using type(df.Embarked.unique()[-1]). Could someone help me understand how to identify those nan cells?

like image 863
ha9u63ar Avatar asked Oct 20 '25 16:10

ha9u63ar


1 Answers

NaN is used to represent missing values.

  • To find them, use .isna()

    Detect missing values.

  • To replace them, use .fillna(value)

    Fill NA/NaN values

Some examples on a series called col:

>>> col
0    1.0
1    NaN
2    2.0
dtype: float64
>>> col[col.isna()]
1   NaN
dtype: float64
>>> col.index[col.isna()]
Int64Index([1], dtype='int64')
>>> col.fillna(-1)
0    1.0
1   -1.0
2    2.0
dtype: float64

Note that you can’t compare equality with nan as by definition it’s not equal to anything, not even itself:

>>> np.nan == np.nan
False

This is likely the property that is used to identify nan under the hood:

>>> col != col
0    False
1     True
2    False
dtype: bool

But it’s better (more readable) to use the pandas functions than to test for inequality yourself.

like image 168
Cimbali Avatar answered Oct 23 '25 10:10

Cimbali