I've searched around for quite a bit, but nothing seems to solve the issue.
Suppose when df is like:
import pandas as pd
import numpy as np
df = pd.DataFrame([['a','b','c'], ['a',np.nan,'b'], [np.nan, 'b', 'a'], ['a', 'd', 'b']])
df
0 1 2
0 a b c
1 a NaN b
2 NaN b a
3 a d b
Desired output is:
0 1 2
0 a b c
3 a d b
Row 1, 2 are a subset of row 0, and hence I'd like to drop them. When checking if a row is a subset of any other, NaN is not considered. Thus row 1 becomes {'a','b'}, thereby a subset.
What I've tried so far is to make sets:
df.ffill(1).bfill(1).apply(set, 1)
which yields:
0 {c, a, b}
1 {a, b}
2 {a, b}
3 {d, a, b}
But I'm stuck here. pd.DataFrame.drop_duplicates doesn't seem to help me here.
Any help is greatly appreciated :)
This is tough. Ideally you want to:
pd.Index, which behaves like a set and has some set-op-like methods.)Both of those things are tricky to do here because of the particular conditions, and the time complexity may get hairy as a result. (I'm not ruling out that there's probably a much slicker answer than this one.) But generally, once you go from exact-duplicate testing to subset testing, things become more difficult.
All that said, you can:
set.issuperset in a greedy any() call to find indices of duplicates, taking advantage of the fact that frozenset is hashable (thanks to the other answer here).The complexity is still N^2 or something close to it, but for moderate-sized data this might be sufficient.
>>> df = pd.DataFrame([['a','b','c'], ['a',np.nan,'b'], [np.nan, 'b', 'a'], ['a', 'd', 'b']])
>>>
>>> seen = set()
>>> add = seen.add
>>> dupes = []
>>>
>>> for pos, row in enumerate(df.values.tolist()):
... vals = frozenset(i for i in row if isinstance(i, str))
... if any(i.issuperset(vals) for i in seen):
... dupes.append(pos)
... add(vals)
...
>>> dupes
[1, 2]
That gets you the indices to drop via DataFrame.drop().
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With