It seems that checking if np.nan is in a list after pulling the list from a pandas dataframe does not correctly return True as expected. I have an example below to demonstrate:
from numpy import nan
import pandas as pd
basic_list = [0.0, nan, 1.0, 2.0]
nan_in_list = (nan in basic_list)
print(f"Is nan in {basic_list}? {nan_in_list}")
df = pd.DataFrame({'test_list': basic_list})
pandas_list = df['test_list'].to_list()
nan_in_pandas_list = (nan in pandas_list)
print(f"Is nan in {pandas_list}? {nan_in_pandas_list}")
I would expect the output of this program to be:
Is nan in [0.0, nan, 1.0, 2.0]? True
Is nan in [0.0, nan, 1.0, 2.0]? True
But instead it is
Is nan in [0.0, nan, 1.0, 2.0]? True
Is nan in [0.0, nan, 1.0, 2.0]? False
What is the cause of this odd behavior or am I missing something?
Edit: Adding on to this, if I run the code:
for item in pandas_list:
print(type(item))
print(item)
it has the exact same output as if I were to swap pandas_list with basic_list. However pandas_list == basic_list evaluates to False.
pandas is using different nan object than np.nan and in operator for list checks if the object is the same.
The in operator invokes __contains__ magic method of list, here is source code:
static int
list_contains(PyListObject *a, PyObject *el)
{
PyObject *item;
Py_ssize_t i;
int cmp;
for (i = 0, cmp = 0 ; cmp == 0 && i < Py_SIZE(a); ++i) {
item = PyList_GET_ITEM(a, i);
Py_INCREF(item);
cmp = PyObject_RichCompareBool(item, el, Py_EQ);
Py_DECREF(item);
}
return cmp;
}
You see there is PyObject_RichCompareBool called which states:
If o1 and o2 are the same object, PyObject_RichCompareBool() will always return 1 for Py_EQ and 0 for Py_NE.
So:
basic_list = [0.0, nan, 1.0, 2.0]
for v in basic_list:
print(v == nan, v is nan)
print(nan in basic_list)
Prints:
False False
False True
False False
False False
True
And:
df = pd.DataFrame({"test_list": basic_list})
pandas_list = df["test_list"].to_list()
for v in pandas_list:
print(v == nan, v is nan)
print(nan in pandas_list)
Prints:
False False
False False
False False
False False
False
Evidently, pandas is using different nan object.
So, for the built-in list type, in checks containment using is first (as an optimization). From the docs:
For container types such as list, tuple, set, frozenset, dict, or collections.deque, the expression
x in yis equivalent toany(x is e or x == e for e in y).
(note, of course, the above isn't how this is actually implemented! dicts and sets, for example, are using hash-based approaches to check containment, so they will be average case O(1) instead of O(n))
This is an optimization that the runtime uses because well behaved types should always respect the logical implication that "if x is y, the x == y". But *this happens to not be true with the very strange value of float('nan').
Since you are using the same object to check, the one that numpy plops into the main namespace for you (it is literally doing something like nan = float('nan')) it turns out this will be true when you construct a list using that object.
We can reproduce this behavior like this:
nan = float('nan')
data = [1, nan, 3]
print(nan in data) # True
print(float('nan') in data) # False
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With