In a Pandas dataframe, I would like to filter out all the rows that have more than 2 NaNs.
Essentially, I have 4 columns and I would like to keep only those rows where at least 2 columns have finite values.
Can somebody advise on how to achieve this?
Filter out NAN rows (Data selection) by using DataFrame. dropna() method. The dropna() function is also possible to drop rows with NaN values df. dropna(thresh=2) it will drop all rows where there are at least two non- NaN .
In Spark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking IS NULL or isNULL . These removes all rows with null values on state column and returns the new DataFrame. All above examples returns the same output.
You have phrased 2 slightly different questions here. In the general case, they have different answers.
I would like to keep only those rows where at least 2 columns have finite values.
df = df.dropna(thresh=2)
This keeps rows with 2 or more non-null values.
I would like to filter out all the rows that have more than 2
NaNs
df = df.dropna(thresh=df.shape[1]-2)
This filters out rows with 2 or more null values.
In your example dataframe of 4 columns, these operations are equivalent, since df.shape[1] - 2 == 2. However, you will notice discrepancies with dataframes which do not have exactly 4 columns.
Note dropna also has a subset argument should you wish to include only specified columns when applying a threshold. For example:
df = df.dropna(subset=['col1', 'col2', 'col3'], thresh=2)
The following should work
df.dropna(thresh=2)
See the online docs
What we are doing here is dropping any NaN rows, where there are 2 or more non NaN values in a row.
Example:
In [25]:
import pandas as pd
df = pd.DataFrame({'a':[1,2,NaN,4,5], 'b':[NaN,2,NaN,4,5], 'c':[1,2,NaN,NaN,NaN], 'd':[1,2,3,NaN,5]})
df
Out[25]:
a b c d
0 1 NaN 1 1
1 2 2 2 2
2 NaN NaN NaN 3
3 4 4 NaN NaN
4 5 5 NaN 5
[5 rows x 4 columns]
In [26]:
df.dropna(thresh=2)
Out[26]:
a b c d
0 1 NaN 1 1
1 2 2 2 2
3 4 4 NaN NaN
4 5 5 NaN 5
[4 rows x 4 columns]
EDIT
For the above example it works but you should note that you would have to know the number of columns and set the thresh value appropriately, I thought originally it meant the number of NaN values but it actually means number of Non NaN values.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With