I am having a dataframe df:
columnA columnB columnC columnD columnE
A B 10 C C
A B 10 D A
B C 20 A A
B A 20 D A
B A 20 D C
I want to drop the duplicates if there are duplicates entries for columnA, columnB, columnC in my case the duplicates are:
columnA columnB columnC columnD columnE
A B 10 C C
A B 10 D A
B A 20 D A
B A 20 D C
How can I keep the one of the duplicate rows, where columnE is equal to C ?
So that the output for the full dataframe is:
columnA columnB columnC columnD columnE
A B 10 C C
B C 20 A A
B A 20 D C
You can use DataFrame.sort_values for prefer C values first with DataFrame.drop_duplicates and or original order add DataFrame.sort_index:
out = (df.sort_values('columnE', key=lambda x: x.ne('C'))
.drop_duplicates(['columnA','columnB','columnC'])
.sort_index())
print (out)
columnA columnB columnC columnD columnE
0 A B 10 C C
2 B C 20 A A
4 B A 20 D C
Or use DataFrameGroupBy.idxmax for indices with prefer C with DataFrame.loc for select rows and Series.sort_values for original ordering:
idx = df['columnE'].eq('C').groupby([df['columnA'],df['columnB'],df['columnC']]).idxmax()
out = df.loc[idx.sort_values()]
print (out)
columnA columnB columnC columnD columnE
0 A B 10 C C
2 B C 20 A A
4 B A 20 D C
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With