The need to rows that have NaN values in them but are also duplicates. For example this table:
A B C
0 foo 2 3
1 foo nan nan
2 foo 1 4
3 bar nan nan
4 foo nan nan
Should become this:
A B C
0 foo 2 3
2 foo 1 4
3 bar nan nan
How can i do that?
Use boolean indexing
:
df = df[~df['A'].duplicated(keep=False) | df[['B','C']].notnull().any(axis=1)]
print (df)
A B C
0 foo 2.0 3.0
2 foo 1.0 4.0
3 bar NaN NaN
Explanation:
Test column A
for not duplicates - duplicated
with ~
for invert boolean mask:
print (~df['A'].duplicated(keep=False))
0 False
1 False
2 False
3 True
4 False
Name: A, dtype: bool
Check non missing values in B,C
columns:
print (df[['B','C']].notnull())
B C
0 True True
1 False False
2 True True
3 False False
4 False False
And then at least one True per row with DataFrame.any
:
print (df[['B','C']].notnull().any(axis=1))
0 True
1 False
2 True
3 False
4 False
dtype: bool
Chain together by |
for bitwise OR
:
print (~df['A'].duplicated(keep=False) | df[['B','C']].notnull().any(axis=1))
0 True
1 False
2 True
3 True
4 False
dtype: bool
Slightly different to jezrael's solution:
>>> df
A B C
0 foo 2.0 3.0
1 foo NaN NaN
2 foo 1.0 4.0
3 bar NaN NaN
4 foo NaN NaN
>>>
>>> df.drop(index=df[df.duplicated(keep=False)].isnull().any(1).index)
A B C
0 foo 2.0 3.0
2 foo 1.0 4.0
3 bar NaN NaN
Steps:
>>> df.duplicated(keep=False)
0 False
1 True
2 False
3 False
4 True
dtype: bool
>>>
>>> df[df.duplicated(keep=False)]
A B C
1 foo NaN NaN
4 foo NaN NaN
>>>
>>> df[df.duplicated(keep=False)].isnull()
A B C
1 False True True
4 False True True
>>>
>>> df[df.duplicated(keep=False)].isnull().any(1).index
Int64Index([1, 4], dtype='int64')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With