pd.NA vs np.nan for pandas. Which one to use with pandas and why to use? What are main advantages and disadvantages of each of them with pandas?
Some sample code that uses them both:
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'object': ['a', 'b', 'c',pd.NA],
                   'numeric': [1, 2, np.nan , 4],
                    'categorical': pd.Categorical(['d', np.nan,'f', 'g'])
                 })
output:
|    | object   |   numeric | categorical   |
|---:|:---------|----------:|:--------------|
|  0 | a        |         1 | d             |
|  1 | b        |         2 | nan           |
|  2 | c        |       nan | f             |
|  3 | <NA>     |         4 | g             |
Note that pandas/NumPy uses the fact that np. nan != np.
The official documentation for pandas defines what most developers would know as null values as missing or missing data in pandas. Within pandas, a missing value is denoted by NaN .
The special value NaN (Not-A-Number) is used everywhere as the NA value, and there are API functions isnull and notnull which can be used across the dtypes to detect NA values.
nan is a single object that always has the same id, no matter which variable you assign it to. np. nan is np. nan is True and one is two is also True .
As of now (release of pandas-1.0.0) I would really recommend to use it carefully.
First, it's still an experimental feature:
Experimental: the behaviour of
pd.NAcan still change without warning.
Second, the behaviour differs from np.nan:
Compared to
np.nan,pd.NAbehaves differently in certain operations. In addition to arithmetic operations,pd.NAalso propagates as “missing” or “unknown” in comparison operations.
Both quotas from release-notes
To show some additional example, I was surprised with interpolation behaviour:
Create simple DataFrame:
df = pd.DataFrame({"a": [0, pd.NA, 2], "b": [0, np.nan, 2]})
df
#       a    b
# 0     0  0.0
# 1  <NA>  NaN
# 2     2  2.0
and try to interpolate:
df.interpolate()
#       a    b
# 0     0  0.0
# 1  <NA>  1.0
# 2     2  2.0
There are some reasons for that (I am still discovering that), anyway, I just want to highlighted those differences - It is an experimental feature and it behaves differently in some cases.
I think it will be very useful feature, but I would be really careful with statements like "It should be completely fine to use it instead of np.nan". It might be true for most cases, but can cause some troubles when you are not aware of it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With