Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python pandas remove duplicate rows that have a column value "NaN"

The need to rows that have NaN values in them but are also duplicates. For example this table:

    A   B   C
0   foo 2   3
1   foo nan nan
2   foo 1   4
3   bar nan nan
4   foo nan nan

Should become this:

    A   B   C
0   foo 2   3
2   foo 1   4
3   bar nan nan

How can i do that?

like image 642
Lame Fanello Avatar asked Sep 02 '25 13:09

Lame Fanello


2 Answers

Use boolean indexing:

df = df[~df['A'].duplicated(keep=False) | df[['B','C']].notnull().any(axis=1)]
print (df)
     A    B    C
0  foo  2.0  3.0
2  foo  1.0  4.0
3  bar  NaN  NaN

Explanation:

Test column A for not duplicates - duplicated with ~ for invert boolean mask:

print (~df['A'].duplicated(keep=False))
0    False
1    False
2    False
3     True
4    False
Name: A, dtype: bool

Check non missing values in B,C columns:

print (df[['B','C']].notnull())
       B      C
0   True   True
1  False  False
2   True   True
3  False  False
4  False  False

And then at least one True per row with DataFrame.any:

print (df[['B','C']].notnull().any(axis=1))
0     True
1    False
2     True
3    False
4    False
dtype: bool

Chain together by | for bitwise OR:

print (~df['A'].duplicated(keep=False) | df[['B','C']].notnull().any(axis=1))
0     True
1    False
2     True
3     True
4    False
dtype: bool
like image 85
jezrael Avatar answered Sep 05 '25 05:09

jezrael


Slightly different to jezrael's solution:

>>> df                                                                                                     
     A    B    C
0  foo  2.0  3.0
1  foo  NaN  NaN
2  foo  1.0  4.0
3  bar  NaN  NaN
4  foo  NaN  NaN
>>>
>>> df.drop(index=df[df.duplicated(keep=False)].isnull().any(1).index)                                  
     A    B    C
0  foo  2.0  3.0
2  foo  1.0  4.0
3  bar  NaN  NaN

Steps:

>>> df.duplicated(keep=False)                                                                            
0    False
1     True
2    False
3    False
4     True
dtype: bool
>>>
>>> df[df.duplicated(keep=False)]                                                                       
      A   B   C
1  foo NaN NaN
4  foo NaN NaN
>>>
>>> df[df.duplicated(keep=False)].isnull()                                                                 
       A     B     C
1  False  True  True
4  False  True  True
>>>
>>> df[df.duplicated(keep=False)].isnull().any(1).index                                                          
Int64Index([1, 4], dtype='int64')
like image 22
timgeb Avatar answered Sep 05 '25 06:09

timgeb