Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Duplicated returns some not duplicate values?

I am trying to remove duplicates from dataset.

Before using df.drop_duplicates(), I run df[df.duplicated()] to check which values are treated as duplicates. Values that I don't consider to be duplicates are returned, see example below. All columns are checked.

How to get accurate duplicate results and drop real duplicates?

city price year manufacturer cylinders fuel odometer whistler 26880 2016.0 chrysler NaN gas 49000.0 whistler 17990 2010.0 toyota NaN hybrid 117000.0 whistler 15890 2010.0 audi NaN gas 188000.0 whistler 8800 2007.0 nissan NaN gas 163000.0

like image 807
ValdemarT Avatar asked Sep 08 '25 10:09

ValdemarT


2 Answers

I was having the same issue and couldn't find a clear solution online, so here's what worked for me. After reading what @Mihaela mentioned about the default value for .duplicated(), I was able to figure out what I was doing wrong. What I wanted (and I assume you wanted) was to return all duplicate rows so you could manually verify that they were, in fact, duplicates. By default, .duplicated() will only return one row for each set of duplicates. If you want to return all rows with duplicate values, you must set keep=False.

Returns only one row for each set of duplicate rows:

df[df.duplicated()]

Example Output:

ID date value
1 3/1 35
2 3/1 27
3 3/1 89

Returns ALL duplicated rows:

df[df.duplicated(keep=False)]

Example Output:

ID date value
1 3/1 35
1 3/1 35
2 3/1 27
2 3/1 27
3 3/1 89
3 3/1 89

Additional Tip

If your dataframe is not already sorted, your results will return mixed in such a way that it can be hard to see that the rows are really duplicates (which happened to me). Chaining sort_values() to the end of the code made it so I could view the sets of duplicate rows together, making it much easier to manually verify that they were identical. Just set the by=['col'] param to be whichever column(s) make sense for your use-case.

Example Output (unsorted):

ID date value
1 3/1 35
1 3/2 37
2 3/2 40
3 3/1 89
2 3/1 27
3 3/1 89
2 3/2 40
1 3/1 35
2 3/1 27
1 3/2 37

Code:

df[df.duplicated(keep=False)].sort_values(by=['id', 'date'])

Example Output (sorted):

ID date value
1 3/1 35
1 3/1 35
1 3/2 37
1 3/2 37
2 3/1 27
2 3/1 27
2 3/2 40
2 3/2 40
3 3/1 89
3 3/1 89
like image 132
Alyssa Bonillas Avatar answered Sep 11 '25 21:09

Alyssa Bonillas


Encountered the same problem.

At first, it looks like

df.duplicated(subset='my_column_of_interest') 

returns results which actually have unique values in my_column_of_interest field.

This is not the case, though. The documentation shows that duplicated uses the keep parameter to opt for keeping all duplicates, just the first or just the last. Its default value is first.

Which means that if you have a value present twice in this column, running df.duplicated(subset='my_column_of_interest') will return results that only contain this value once (since only its first occurrence is kept).

like image 21
Mihaela Grigore Avatar answered Sep 11 '25 22:09

Mihaela Grigore