Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Intersecting two sets, retaining all (up to) three parts efficiently

Tags:

python

set

If you have two sets a and b and intersect them, there are three interesting parts (which may be empty): h(ead) elements of a not in b, i(ntersection) elements in both a and b, and t(ail) elements of b not in a.

For example: {1, 2, 3} & {2, 3, 4} -> h:{1}, i:{2, 3}, t:{4} (not actual Python code, clearly)

One very clean way to code that in Python:

h, i, t = a - b, a & b, b - a

I figure that this can be slightly more efficient though:

h, t = a - (i := a & b), b - i

Since it first computes the intersection and then subtracts only that from a and then b, which would help if i is small and a and b are large - although I suppose it depends on the implementation of the subtraction whether it's truly faster. It's not likely to be worse, as far as I can tell.

I was unable to find such an operator or function, but since I can imagine efficient implementations that would perform the three-way split of a and b into h, i, and t in fewer iterations, am I missing something like this, which may already exist?

from magical_set_stuff import hit

h, i, t = hit(a, b)
like image 836
Grismar Avatar asked Oct 25 '25 17:10

Grismar


2 Answers

It's not in Python, and I haven't seen such a thing in a 3rd-party library either.

Here's a perhaps unexpected approach that's largely insensitive to which sets are bigger than others, and to how much overlap among inputs there may be. I dreamed it up when facing a related problem: suppose you had 3 input sets, and wanted to derive the 7 interesting sets of overlaps (in A only, B only, C only, both A and B, both A and C, both B and C, or in all 3). This version strips that down to the 2-input case. In general, assign a unique power of 2 to each input, and use those as bit flags:

def hit(a, b):
    x2flags = defaultdict(int)
    for x in a:
        x2flags[x] = 1
    for x in b:
        x2flags[x] |= 2
    result = [None, set(), set(), set()]
    for x, flag in x2flags.items():
        result[flag].add(x)
    return result[1], result[3], result[2]
like image 56
Tim Peters Avatar answered Oct 27 '25 06:10

Tim Peters


I won't accept my own answer unless nobody manages to beat my own solution or any of the good and concise Python ones.

But for anyone interested in some numbers:

from random import randint
from timeit import timeit


def grismar(a: set, b: set):
    h, i, t = set(), set(), b.copy()
    for x in a:
        if x in t:
            i.add(x)
            t.remove(x)
        else:
            h.add(x)
    return h, i, t


def good(a: set, b: set):
    return a - b, a & b, b - a


def better(a: set, b: set):
    h, t = a - (i := a & b), b - i
    return h, i, t


def ok(a: set, b: set):
    return a - (a & b), a & b, b - (a & b)


from collections import defaultdict
def tim(a, b):
    x2flags = defaultdict(int)
    for x in a:
        x2flags[x] = 1
    for x in b:
        x2flags[x] |= 2
    result = [None, set(), set(), set()]
    for x, flag in x2flags.items():
        result[flag].add(x)
    return result[1], result[3], result[2]


def pychopath(a, b):
    h, t = set(), b.copy()
    h_add = h.add
    t_remove = t.remove
    i = {x for x in a
         if x in t and not t_remove(x) or h_add(x)}
    return h, i, t


def enke(a, b):
    t = b - (i := a - (h := a - b))
    return h, i, t


xs = set(randint(0, 10000) for _ in range(10000))
ys = set(randint(0, 10000) for _ in range(10000))

# validation
g = (f(xs, ys) for f in (grismar, good, better, ok, tim, enke))
l = set(tuple(tuple(sorted(s)) for s in t) for t in g)
assert len(l) == 1, 'functions are equivalent'

# warmup, not competing
timeit(lambda: grismar(xs, ys), number=500)

# competition
print('a - b, a & b, b - a ', timeit(lambda: good(xs, ys), number=10000))
print('a - (i := a & b), b - i ', timeit(lambda: better(xs, ys), number=10000))
print('a - (a & b), a & b, b - (a & b) ', timeit(lambda: ok(xs, ys), number=10000))
print('tim ', timeit(lambda: tim(xs, ys), number=10000))
print('grismar ', timeit(lambda: grismar(xs, ys), number=10000))
print('pychopath ', timeit(lambda: pychopath(xs, ys), number=10000))
print('b - (i := a - (h := a - b)) ', timeit(lambda: enke(xs, ys), number=10000))

Results:

a - b, a & b, b - a  5.6963334
a - (i := a & b), b - i  5.3934624
a - (a & b), a & b, b - (a & b)  9.7732018
tim  16.3080373
grismar  7.709292500000004
pychopath  6.76331460000074
b - (i := a - (h := a - b))  5.197220600000001

So far, the optimisation proposed by @enke in the comments appears to win out:

t = b - (i := a - (h := a - b))
return h, i, t

Edit: added @Pychopath's results, which is indeed substantially faster than my own, although @enke's result is still the one to beat (and likely won't be with just Python). If @enke posts their own answer, I'd happily accept it as the answer.

like image 33
Grismar Avatar answered Oct 27 '25 05:10

Grismar



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!