I have a data frame that looks like:
Name State Gender OtherVariables
Sam CO M
Sam CO F
Sam CO M
Jim CO M
Jim WY M
The following code gives me all of the duplicate names: (Sam and Jim):
def list_duplicates(seq):
seen = set()
seen_add = seen.add
seen_twice = set(x for x in seq if x in seen or seen_add(x))
return list(seen_twice)
dups = list_duplicates(df.name)
But what I want is:
Name State Gender
Sam CO M
I only want those rows with the same Name, State and Gender. So Jim shouldn't be there. The "OtherVariables" are different for each row.
You can use boolean indexing with mask created by duplicated:
df = df[df.duplicated(['Name','State','Gender'])]
print (df)
Name State Gender
2 Sam CO M
Use pandas.DataFrame.duplicated with the subset argument.
Example:
duplicates = df.duplicated(subset=['Name', 'State', 'Gender'])
df[duplicates]
See the documentation
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With