Pandas

Question

I am currently playing with Kaggle Titanic dataset (train.csv)

I can load the data fine.
I understood that some data in Embarked column has nan value. But when I tried to filter it using the following code, I am getting an empty array

    import pandas as pd
    df = df.read_csv(<file_loc>, header=0)
    df[df.Embarked == 'nan']

I tried to import numpy.nan to replace the string nan above. But it doesn't work.

What am I trying to find - is all the cells which are not 'S', 'C', 'Q'.

Also realised later that.... the nan is a Float type using type(df.Embarked.unique()[-1]). Could someone help me understand how to identify those nan cells?

Cimbali · Accepted Answer

NaN is used to represent missing values.

To find them, use .isna()

Detect missing values.
To replace them, use .fillna(value)

Fill NA/NaN values

Some examples on a series called col:

>>> col
0    1.0
1    NaN
2    2.0
dtype: float64
>>> col[col.isna()]
1   NaN
dtype: float64
>>> col.index[col.isna()]
Int64Index([1], dtype='int64')
>>> col.fillna(-1)
0    1.0
1   -1.0
2    2.0
dtype: float64

Note that you can’t compare equality with nan as by definition it’s not equal to anything, not even itself:

>>> np.nan == np.nan
False

This is likely the property that is used to identify nan under the hood:

>>> col != col
0    False
1     True
2    False
dtype: bool

But it’s better (more readable) to use the pandas functions than to test for inequality yourself.

Pandas - How to identify `nan` values in a Series

Tags:

python

ha9u63ar

1 Answers

Cimbali

Recent Activity

Donate For Us