Pandas Duplicated returns some not duplicate values?

Question

I am trying to remove duplicates from dataset.

Before using df.drop_duplicates(), I run df[df.duplicated()] to check which values are treated as duplicates. Values that I don't consider to be duplicates are returned, see example below. All columns are checked.

How to get accurate duplicate results and drop real duplicates?

city price year manufacturer cylinders fuel odometer whistler 26880 2016.0 chrysler NaN gas 49000.0 whistler 17990 2010.0 toyota NaN hybrid 117000.0 whistler 15890 2010.0 audi NaN gas 188000.0 whistler 8800 2007.0 nissan NaN gas 163000.0

Alyssa Bonillas · Accepted Answer

I was having the same issue and couldn't find a clear solution online, so here's what worked for me. After reading what @Mihaela mentioned about the default value for .duplicated(), I was able to figure out what I was doing wrong. What I wanted (and I assume you wanted) was to return all duplicate rows so you could manually verify that they were, in fact, duplicates. By default, .duplicated() will only return one row for each set of duplicates. If you want to return all rows with duplicate values, you must set keep=False.

Returns only one row for each set of duplicate rows:

df[df.duplicated()]

Example Output:

ID	date	value
1	3/1	35
2	3/1	27
3	3/1	89

Returns ALL duplicated rows:

df[df.duplicated(keep=False)]

Example Output:

ID	date	value
1	3/1	35
1	3/1	35
2	3/1	27
2	3/1	27
3	3/1	89
3	3/1	89

Additional Tip

If your dataframe is not already sorted, your results will return mixed in such a way that it can be hard to see that the rows are really duplicates (which happened to me). Chaining sort_values() to the end of the code made it so I could view the sets of duplicate rows together, making it much easier to manually verify that they were identical. Just set the by=['col'] param to be whichever column(s) make sense for your use-case.

Example Output (unsorted):

ID	date	value
1	3/1	35
1	3/2	37
2	3/2	40
3	3/1	89
2	3/1	27
3	3/1	89
2	3/2	40
1	3/1	35
2	3/1	27
1	3/2	37

Code:

df[df.duplicated(keep=False)].sort_values(by=['id', 'date'])

Example Output (sorted):

ID	date	value
1	3/1	35
1	3/1	35
1	3/2	37
1	3/2	37
2	3/1	27
2	3/1	27
2	3/2	40
2	3/2	40
3	3/1	89
3	3/1	89

Mihaela Grigore · Answer

Encountered the same problem.

At first, it looks like

df.duplicated(subset='my_column_of_interest')

returns results which actually have unique values in my_column_of_interest field.

This is not the case, though. The documentation shows that duplicated uses the keep parameter to opt for keeping all duplicates, just the first or just the last. Its default value is first.

Which means that if you have a value present twice in this column, running df.duplicated(subset='my_column_of_interest') will return results that only contain this value once (since only its first occurrence is kept).

Pandas Duplicated returns some not duplicate values?

Tags:

pandas

duplicates

ValdemarT

2 Answers

Returns only one row for each set of duplicate rows:

Returns ALL duplicated rows:

Additional Tip

Alyssa Bonillas

Mihaela Grigore

Recent Activity

Donate For Us

ID	date	value
1	3/1	35
1	3/2	37
2	3/2	40
3	3/1	89
2	3/1	27
3	3/1	89
2	3/2	40
1	3/1	35
2	3/1	27
1	3/2	37

ID	date	value
1	3/1	35
1	3/1	35
1	3/2	37
1	3/2	37
2	3/1	27
2	3/1	27
2	3/2	40
2	3/2	40
3	3/1	89
3	3/1	89

ID	date	value
1	3/1	35
1	3/2	37
2	3/2	40
3	3/1	89
2	3/1	27
3	3/1	89
2	3/2	40
1	3/1	35
2	3/1	27
1	3/2	37

ID	date	value
1	3/1	35
1	3/1	35
1	3/2	37
1	3/2	37
2	3/1	27
2	3/1	27
2	3/2	40
2	3/2	40
3	3/1	89
3	3/1	89

Pandas Duplicated returns some not duplicate values?

Tags:

pandas

duplicates

ValdemarT

2 Answers

Returns only one row for each set of duplicate rows:

Returns ALL duplicated rows:

Additional Tip

Alyssa Bonillas

Mihaela Grigore

Related questions

Recent Activity

Donate For Us

ID	date	value
1	3/1	35
1	3/2	37
2	3/2	40
3	3/1	89
2	3/1	27
3	3/1	89
2	3/2	40
1	3/1	35
2	3/1	27
1	3/2	37

ID	date	value
1	3/1	35
1	3/1	35
1	3/2	37
1	3/2	37
2	3/1	27
2	3/1	27
2	3/2	40
2	3/2	40
3	3/1	89
3	3/1	89