Subsetting duplicate rows in Python

Question

I have a data frame that looks like:

Name    State    Gender    OtherVariables
Sam     CO       M
Sam     CO       F
Sam     CO       M
Jim     CO       M
Jim     WY       M

The following code gives me all of the duplicate names: (Sam and Jim):

def list_duplicates(seq):
  seen = set()
  seen_add = seen.add
  seen_twice = set(x for x in seq if x in seen or seen_add(x))
  return list(seen_twice)

dups = list_duplicates(df.name)

But what I want is:

Name    State    Gender
Sam     CO       M

I only want those rows with the same Name, State and Gender. So Jim shouldn't be there. The "OtherVariables" are different for each row.

jezrael · Accepted Answer

You can use boolean indexing with mask created by duplicated:

df = df[df.duplicated(['Name','State','Gender'])]
print (df)

  Name State Gender
2  Sam    CO      M

Jivan · Answer

Use pandas.DataFrame.duplicated with the subset argument.

Example:

duplicates = df.duplicated(subset=['Name', 'State', 'Gender'])
df[duplicates]

See the documentation

Subsetting duplicate rows in Python

Tags:

python-3.x

pandas

J Sedai

2 Answers

jezrael

Jivan

Recent Activity

Donate For Us

Subsetting duplicate rows in Python

Tags:

python-3.x

pandas

J Sedai

2 Answers

jezrael

Jivan

Related questions

Recent Activity

Donate For Us