I have a dataframe with 3 columns in Python:
Name1 Name2 Value
Juan  Ale   1
Ale   Juan  1
and would like to eliminate the duplicates based on columns Name1 and Name2 combinations.
In my example both rows are equal (but they are in different order), and I would like to delete the second row and just keep the first one, so the end result should be:
Name1 Name2 Value
Juan  Ale   1
Any idea will be really appreciated!
By using pandas. DataFrame. drop_duplicates() method you can remove duplicate rows from DataFrame. Using this method you can drop duplicate rows on selected multiple columns or all columns.
Remove duplicate combinations – TutorialSelect your data. Click on Data > Remove Duplicates button.
By using np.sort with duplicated
df[pd.DataFrame(np.sort(df[['Name1','Name2']].values,1)).duplicated()]
Out[614]: 
  Name1 Name2  Value
1   Ale  Juan      1
Performance
df=pd.concat([df]*100000)
%timeit df[pd.DataFrame(np.sort(df[['Name1','Name2']].values,1)).duplicated()]
10 loops, best of 3: 69.3 ms per loop
%timeit df[~df[['Name1', 'Name2']].apply(frozenset, axis=1).duplicated()]
1 loop, best of 3: 3.72 s per loop
You can convert to frozenset and use pd.DataFrame.duplicated.
res = df[~df[['Name1', 'Name2']].apply(frozenset, axis=1).duplicated()]
print(res)
  Name1 Name2  Value
0  Juan   Ale      1
frozenset is necessary instead of set since duplicated uses hashing to check for duplicates.
Scales better with columns than rows. For a large number of rows, use @Wen's sort-based algorithm.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With