Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas drop row if it is a subset of other row

Tags:

python

pandas

I've searched around for quite a bit, but nothing seems to solve the issue.

Suppose when df is like:

import pandas as pd
import numpy as np

df = pd.DataFrame([['a','b','c'], ['a',np.nan,'b'], [np.nan, 'b', 'a'], ['a', 'd', 'b']])
df
     0    1  2
0    a    b  c
1    a  NaN  b
2  NaN    b  a
3    a    d  b

Desired output is:

     0    1  2
0    a    b  c
3    a    d  b

Row 1, 2 are a subset of row 0, and hence I'd like to drop them. When checking if a row is a subset of any other, NaN is not considered. Thus row 1 becomes {'a','b'}, thereby a subset.

What I've tried so far is to make sets:

df.ffill(1).bfill(1).apply(set, 1)

which yields:

0    {c, a, b}
1       {a, b}
2       {a, b}
3    {d, a, b}

But I'm stuck here. pd.DataFrame.drop_duplicates doesn't seem to help me here.

Any help is greatly appreciated :)

like image 304
Chris Avatar asked Oct 29 '25 07:10

Chris


1 Answers

This is tough. Ideally you want to:

  • Stick with Pandas vectorized operations rather than looping through rows. (At first I thought of pd.Index, which behaves like a set and has some set-op-like methods.)
  • Use hashtable-like data structures for set membership testing wherever possible.

Both of those things are tricky to do here because of the particular conditions, and the time complexity may get hairy as a result. (I'm not ruling out that there's probably a much slicker answer than this one.) But generally, once you go from exact-duplicate testing to subset testing, things become more difficult.

All that said, you can:

  1. Convert DataFrame to a nested list - cut down on unncessary overhead with iterating over a Pandas data structure to the extent possible
  2. Use set.issuperset in a greedy any() call to find indices of duplicates, taking advantage of the fact that frozenset is hashable (thanks to the other answer here).

The complexity is still N^2 or something close to it, but for moderate-sized data this might be sufficient.

>>> df = pd.DataFrame([['a','b','c'], ['a',np.nan,'b'], [np.nan, 'b', 'a'], ['a', 'd', 'b']])
>>> 
>>> seen = set()
>>> add = seen.add
>>> dupes = []
>>> 
>>> for pos, row in enumerate(df.values.tolist()):
...     vals = frozenset(i for i in row if isinstance(i, str))
...     if any(i.issuperset(vals) for i in seen):
...         dupes.append(pos)
...     add(vals)
... 
>>> dupes
[1, 2]

That gets you the indices to drop via DataFrame.drop().

like image 118
Brad Solomon Avatar answered Oct 30 '25 20:10

Brad Solomon



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!