Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Subsetting duplicate rows in Python

I have a data frame that looks like:

Name    State    Gender    OtherVariables
Sam     CO       M
Sam     CO       F
Sam     CO       M
Jim     CO       M
Jim     WY       M

The following code gives me all of the duplicate names: (Sam and Jim):

def list_duplicates(seq):
  seen = set()
  seen_add = seen.add
  seen_twice = set(x for x in seq if x in seen or seen_add(x))
  return list(seen_twice)

dups = list_duplicates(df.name)

But what I want is:

Name    State    Gender
Sam     CO       M

I only want those rows with the same Name, State and Gender. So Jim shouldn't be there. The "OtherVariables" are different for each row.

like image 841
J Sedai Avatar asked Feb 24 '26 05:02

J Sedai


2 Answers

You can use boolean indexing with mask created by duplicated:

df = df[df.duplicated(['Name','State','Gender'])]
print (df)

  Name State Gender
2  Sam    CO      M
like image 56
jezrael Avatar answered Feb 27 '26 02:02

jezrael


Use pandas.DataFrame.duplicated with the subset argument.

Example:

duplicates = df.duplicated(subset=['Name', 'State', 'Gender'])
df[duplicates]

See the documentation

like image 29
Jivan Avatar answered Feb 27 '26 02:02

Jivan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!