I want to list a dataframe where a specific column is either null or not null, I have it working using -
df[df.Survive.notnull()] # contains no missing values
df[df.Survive.isnull()] #---> contains missing values
This works perfectly but I want to make my code more dynamic and pass the column "Survive" as a variable but it's not working for me.
I tried:
variableToPredict = ['Survive']
df[df[variableToPredict].notnull()]
I get the error - ValueError: cannot reindex from a duplicate axis
I'm sure I'm making a silly mistake, what can I do to fix this?
So idea is always is necessary Series or list or 1d array for mask for filtering.
If want test only one column use scalar:
variableToPredict = 'Survive'
df[df[variableToPredict].notnull()]
But if add [] output is one column DataFrame, so is necessaty change function for test by any (test if at least one NaN per row, sense in multiple columns) or all (test if all NaNs per row, sense in multiple columns) functions:
variableToPredict = ['Survive']
df[df[variableToPredict].notnull().any(axis=1)]
variableToPredict = ['Survive', 'another column']
df[df[variableToPredict].notnull().any(axis=1)]
Sample:
df = pd.DataFrame({'Survive':[np.nan, 'A', 'B', 'B', np.nan],
'another column':[np.nan, np.nan, 'a','b','b']})
print (df)
Survive another column
0 NaN NaN
1 A NaN
2 B a
3 B b
4 NaN b
First if test only one column:
variableToPredict = 'Survive'
df1 = df[df[variableToPredict].notnull()]
print (df1)
Survive another column
1 A NaN
2 B a
3 B b
print (type(df[variableToPredict]))
<class 'pandas.core.series.Series'>
#Series
print (df[variableToPredict])
0 NaN
1 A
2 B
3 B
4 NaN
Name: Survive, dtype: object
print (df[variableToPredict].isnull())
0 True
1 False
2 False
3 False
4 True
Name: Survive, dtype: bool
If use list - here one element list:
variableToPredict = ['Survive']
print (type(df[variableToPredict]))
<class 'pandas.core.frame.DataFrame'>
#one element DataFrame
print (df[variableToPredict])
Survive
0 NaN
1 A
2 B
3 B
4 NaN
If testing per rows it is same output for any or all:
print (df[variableToPredict].notnull().any(axis=1))
0 False
1 True
2 True
3 True
4 False
dtype: bool
print (df[variableToPredict].notnull().all(axis=1))
0 False
1 True
2 True
3 True
4 False
dtype: bool
If test one or more columns in list:
variableToPredict = ['Survive', 'another column']
print (type(df[variableToPredict]))
<class 'pandas.core.frame.DataFrame'>
print (df[variableToPredict])
Survive another column
0 NaN NaN
1 A NaN
2 B a
3 B b
4 NaN b
print (df[variableToPredict].notnull())
Survive another column
0 False False
1 True False
2 True True
3 True True
4 False True
#at least one NaN per row, at least one True
print (df[variableToPredict].notnull().any(axis=1))
0 False
1 True
2 True
3 True
4 True
dtype: bool
#all NaNs per row, all Trues
print (df[variableToPredict].notnull().all(axis=1))
0 False
1 False
2 True
3 True
4 False
dtype: bool
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With