pandas drop row if it is a subset of other row

Question

I've searched around for quite a bit, but nothing seems to solve the issue.

Suppose when df is like:

import pandas as pd
import numpy as np

df = pd.DataFrame([['a','b','c'], ['a',np.nan,'b'], [np.nan, 'b', 'a'], ['a', 'd', 'b']])
df
     0    1  2
0    a    b  c
1    a  NaN  b
2  NaN    b  a
3    a    d  b

Desired output is:

     0    1  2
0    a    b  c
3    a    d  b

Row 1, 2 are a subset of row 0, and hence I'd like to drop them. When checking if a row is a subset of any other, NaN is not considered. Thus row 1 becomes {'a','b'}, thereby a subset.

What I've tried so far is to make sets:

df.ffill(1).bfill(1).apply(set, 1)

which yields:

0    {c, a, b}
1       {a, b}
2       {a, b}
3    {d, a, b}

But I'm stuck here. pd.DataFrame.drop_duplicates doesn't seem to help me here.

Any help is greatly appreciated :)

Brad Solomon · Accepted Answer

This is tough. Ideally you want to:

Stick with Pandas vectorized operations rather than looping through rows. (At first I thought of pd.Index, which behaves like a set and has some set-op-like methods.)
Use hashtable-like data structures for set membership testing wherever possible.

Both of those things are tricky to do here because of the particular conditions, and the time complexity may get hairy as a result. (I'm not ruling out that there's probably a much slicker answer than this one.) But generally, once you go from exact-duplicate testing to subset testing, things become more difficult.

All that said, you can:

Convert DataFrame to a nested list - cut down on unncessary overhead with iterating over a Pandas data structure to the extent possible
Use set.issuperset in a greedy any() call to find indices of duplicates, taking advantage of the fact that frozenset is hashable (thanks to the other answer here).

The complexity is still N^2 or something close to it, but for moderate-sized data this might be sufficient.

>>> df = pd.DataFrame([['a','b','c'], ['a',np.nan,'b'], [np.nan, 'b', 'a'], ['a', 'd', 'b']])
>>> 
>>> seen = set()
>>> add = seen.add
>>> dupes = []
>>> 
>>> for pos, row in enumerate(df.values.tolist()):
...     vals = frozenset(i for i in row if isinstance(i, str))
...     if any(i.issuperset(vals) for i in seen):
...         dupes.append(pos)
...     add(vals)
... 
>>> dupes
[1, 2]

That gets you the indices to drop via DataFrame.drop().

pandas drop row if it is a subset of other row

Tags:

python

pandas

Chris

1 Answers

Brad Solomon

Recent Activity

Donate For Us

pandas drop row if it is a subset of other row

Tags:

python

pandas

Chris

1 Answers

Brad Solomon

Related questions

Recent Activity

Donate For Us