I am trying to remove duplicates from dataset.
Before using df.drop_duplicates()
, I run df[df.duplicated()]
to check which values are treated as duplicates. Values that I don't consider to be duplicates are returned, see example below. All columns are checked.
How to get accurate duplicate results and drop real duplicates?
city price year manufacturer cylinders fuel odometer
whistler 26880 2016.0 chrysler NaN gas 49000.0
whistler 17990 2010.0 toyota NaN hybrid 117000.0
whistler 15890 2010.0 audi NaN gas 188000.0
whistler 8800 2007.0 nissan NaN gas 163000.0
I was having the same issue and couldn't find a clear solution online, so here's what worked for me. After reading what @Mihaela mentioned about the default value for .duplicated(), I was able to figure out what I was doing wrong. What I wanted (and I assume you wanted) was to return all duplicate rows so you could manually verify that they were, in fact, duplicates. By default, .duplicated() will only return one row for each set of duplicates. If you want to return all rows with duplicate values, you must set keep=False.
df[df.duplicated()]
Example Output:
ID | date | value |
---|---|---|
1 | 3/1 | 35 |
2 | 3/1 | 27 |
3 | 3/1 | 89 |
df[df.duplicated(keep=False)]
Example Output:
ID | date | value |
---|---|---|
1 | 3/1 | 35 |
1 | 3/1 | 35 |
2 | 3/1 | 27 |
2 | 3/1 | 27 |
3 | 3/1 | 89 |
3 | 3/1 | 89 |
If your dataframe is not already sorted, your results will return mixed in such a way that it can be hard to see that the rows are really duplicates (which happened to me). Chaining sort_values() to the end of the code made it so I could view the sets of duplicate rows together, making it much easier to manually verify that they were identical. Just set the by=['col'] param to be whichever column(s) make sense for your use-case.
Example Output (unsorted):
ID | date | value |
---|---|---|
1 | 3/1 | 35 |
1 | 3/2 | 37 |
2 | 3/2 | 40 |
3 | 3/1 | 89 |
2 | 3/1 | 27 |
3 | 3/1 | 89 |
2 | 3/2 | 40 |
1 | 3/1 | 35 |
2 | 3/1 | 27 |
1 | 3/2 | 37 |
Code:
df[df.duplicated(keep=False)].sort_values(by=['id', 'date'])
Example Output (sorted):
ID | date | value |
---|---|---|
1 | 3/1 | 35 |
1 | 3/1 | 35 |
1 | 3/2 | 37 |
1 | 3/2 | 37 |
2 | 3/1 | 27 |
2 | 3/1 | 27 |
2 | 3/2 | 40 |
2 | 3/2 | 40 |
3 | 3/1 | 89 |
3 | 3/1 | 89 |
Encountered the same problem.
At first, it looks like
df.duplicated(subset='my_column_of_interest')
returns results which actually have unique values in my_column_of_interest field.
This is not the case, though. The documentation shows that duplicated uses the keep parameter to opt for keeping all duplicates, just the first or just the last. Its default value is first.
Which means that if you have a value present twice in this column, running df.duplicated(subset='my_column_of_interest') will return results that only contain this value once (since only its first occurrence is kept).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With