I need to drop all duplicates using Pandas, except for the ones where the cell contains a certain string.
Given that DF is:
NAME ID
Joe 110
Joe 123
Joe PENDING
Mary PENDING
Mary 110
Justin 123
I need to keep rows where 'ID' is PENDING, and at the same time drop the rest of the duplicates.
Desired output looks like this:
NAME ID
Joe 110
Joe 123
Joe PENDING
Mary PENDING
You could use duplicated:
import pandas as pd
data = [['Joe', 110],
['Joe', 123],
['Joe', 'PENDING'],
['Mary', 'PENDING'],
['Mary', 110],
['Justin', 123]]
df = pd.DataFrame(data=data, columns=['NAME', 'ID'])
print(df[~df.duplicated('ID') | (df['ID'] == 'PENDING')])
As an alternative you could do:
print(df[df.ID.duplicated(keep='last') | df.ID.eq('PENDING')])
Output
NAME ID
0 Joe 110
1 Joe 123
2 Joe PENDING
3 Mary PENDING
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With