I've searched around for quite a bit, but nothing seems to solve the issue.
Suppose when df is like:
import pandas as pd
import numpy as np
df = pd.DataFrame([['a','b','c'], ['a',np.nan,'b'], [np.nan, 'b', 'a'], ['a', 'd', 'b']])
df
     0    1  2
0    a    b  c
1    a  NaN  b
2  NaN    b  a
3    a    d  b
Desired output is:
     0    1  2
0    a    b  c
3    a    d  b
Row 1, 2 are a subset of row 0, and hence I'd like to drop them. When checking if a row is a subset of any other, NaN is not considered. Thus row 1 becomes {'a','b'}, thereby a subset.
What I've tried so far is to make sets:
df.ffill(1).bfill(1).apply(set, 1)
which yields:
0    {c, a, b}
1       {a, b}
2       {a, b}
3    {d, a, b}
But I'm stuck here. pd.DataFrame.drop_duplicates doesn't seem to help me here.
Any help is greatly appreciated :)
This is tough. Ideally you want to:
pd.Index, which behaves like a set and has some set-op-like methods.)Both of those things are tricky to do here because of the particular conditions, and the time complexity may get hairy as a result. (I'm not ruling out that there's probably a much slicker answer than this one.) But generally, once you go from exact-duplicate testing to subset testing, things become more difficult.
All that said, you can:
set.issuperset in a greedy any() call to find indices of duplicates, taking advantage of the fact that frozenset is hashable (thanks to the other answer here).The complexity is still N^2 or something close to it, but for moderate-sized data this might be sufficient.
>>> df = pd.DataFrame([['a','b','c'], ['a',np.nan,'b'], [np.nan, 'b', 'a'], ['a', 'd', 'b']])
>>> 
>>> seen = set()
>>> add = seen.add
>>> dupes = []
>>> 
>>> for pos, row in enumerate(df.values.tolist()):
...     vals = frozenset(i for i in row if isinstance(i, str))
...     if any(i.issuperset(vals) for i in seen):
...         dupes.append(pos)
...     add(vals)
... 
>>> dupes
[1, 2]
That gets you the indices to drop via DataFrame.drop().
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With