If you have two sets a and b and intersect them, there are three interesting parts (which may be empty): h(ead) elements of a not in b, i(ntersection) elements in both a and b, and t(ail) elements of b not in a.
For example: {1, 2, 3} & {2, 3, 4} -> h:{1}, i:{2, 3}, t:{4} (not actual Python code, clearly)
One very clean way to code that in Python:
h, i, t = a - b, a & b, b - a
I figure that this can be slightly more efficient though:
h, t = a - (i := a & b), b - i
Since it first computes the intersection and then subtracts only that from a and then b, which would help if i is small and a and b are large - although I suppose it depends on the implementation of the subtraction whether it's truly faster. It's not likely to be worse, as far as I can tell.
I was unable to find such an operator or function, but since I can imagine efficient implementations that would perform the three-way split of a and b into h, i, and t in fewer iterations, am I missing something like this, which may already exist?
from magical_set_stuff import hit
h, i, t = hit(a, b)
It's not in Python, and I haven't seen such a thing in a 3rd-party library either.
Here's a perhaps unexpected approach that's largely insensitive to which sets are bigger than others, and to how much overlap among inputs there may be. I dreamed it up when facing a related problem: suppose you had 3 input sets, and wanted to derive the 7 interesting sets of overlaps (in A only, B only, C only, both A and B, both A and C, both B and C, or in all 3). This version strips that down to the 2-input case. In general, assign a unique power of 2 to each input, and use those as bit flags:
def hit(a, b):
x2flags = defaultdict(int)
for x in a:
x2flags[x] = 1
for x in b:
x2flags[x] |= 2
result = [None, set(), set(), set()]
for x, flag in x2flags.items():
result[flag].add(x)
return result[1], result[3], result[2]
I won't accept my own answer unless nobody manages to beat my own solution or any of the good and concise Python ones.
But for anyone interested in some numbers:
from random import randint
from timeit import timeit
def grismar(a: set, b: set):
h, i, t = set(), set(), b.copy()
for x in a:
if x in t:
i.add(x)
t.remove(x)
else:
h.add(x)
return h, i, t
def good(a: set, b: set):
return a - b, a & b, b - a
def better(a: set, b: set):
h, t = a - (i := a & b), b - i
return h, i, t
def ok(a: set, b: set):
return a - (a & b), a & b, b - (a & b)
from collections import defaultdict
def tim(a, b):
x2flags = defaultdict(int)
for x in a:
x2flags[x] = 1
for x in b:
x2flags[x] |= 2
result = [None, set(), set(), set()]
for x, flag in x2flags.items():
result[flag].add(x)
return result[1], result[3], result[2]
def pychopath(a, b):
h, t = set(), b.copy()
h_add = h.add
t_remove = t.remove
i = {x for x in a
if x in t and not t_remove(x) or h_add(x)}
return h, i, t
def enke(a, b):
t = b - (i := a - (h := a - b))
return h, i, t
xs = set(randint(0, 10000) for _ in range(10000))
ys = set(randint(0, 10000) for _ in range(10000))
# validation
g = (f(xs, ys) for f in (grismar, good, better, ok, tim, enke))
l = set(tuple(tuple(sorted(s)) for s in t) for t in g)
assert len(l) == 1, 'functions are equivalent'
# warmup, not competing
timeit(lambda: grismar(xs, ys), number=500)
# competition
print('a - b, a & b, b - a ', timeit(lambda: good(xs, ys), number=10000))
print('a - (i := a & b), b - i ', timeit(lambda: better(xs, ys), number=10000))
print('a - (a & b), a & b, b - (a & b) ', timeit(lambda: ok(xs, ys), number=10000))
print('tim ', timeit(lambda: tim(xs, ys), number=10000))
print('grismar ', timeit(lambda: grismar(xs, ys), number=10000))
print('pychopath ', timeit(lambda: pychopath(xs, ys), number=10000))
print('b - (i := a - (h := a - b)) ', timeit(lambda: enke(xs, ys), number=10000))
Results:
a - b, a & b, b - a 5.6963334
a - (i := a & b), b - i 5.3934624
a - (a & b), a & b, b - (a & b) 9.7732018
tim 16.3080373
grismar 7.709292500000004
pychopath 6.76331460000074
b - (i := a - (h := a - b)) 5.197220600000001
So far, the optimisation proposed by @enke in the comments appears to win out:
t = b - (i := a - (h := a - b))
return h, i, t
Edit: added @Pychopath's results, which is indeed substantially faster than my own, although @enke's result is still the one to beat (and likely won't be with just Python). If @enke posts their own answer, I'd happily accept it as the answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With