I have a set of bitstrings: {'0011', '1100', '1110'} (all bitstrings within a set are of same length).
I want to quickly find the bitstring of same length that has the smallest max-similarity to the set. Max-similarity can be computed as such:
def max_similarity(bitstring, set):
    max = 0
    for item in set:
        temp = 0
        for i in range(len(bitstring)):
            if bitstring[i] == item[i]:
                temp += 1
        if temp > max:
            max = temp
    return max
I know that I can iterate through all the possible bitstrings of that length, compute the max-similarity for each one and finally keep the smallest of these iterations. But this solves the problem in O(2^n). I would like to know if someone sees any quicker alternatives.
I've been playing around with Pythons XOR:
def int2bin(integer, digits):
    if integer >= 0:
        return bin(integer)[2:].zfill(digits)
    else:
        return bin(2**digits + integer)[2:]
def XOR(bitset):  
    intset = [int('{}'.format(bitstring), 2) for bitstring in bitset]
    digits = len(bitset.pop())
    if len(intset) == 1:
        return int2bin(~intset.pop(), digits)        
    else:
        curr = 0    
        while intset:
            curr = curr ^ intset.pop()
        return int2bin(curr, digits)
if __name__ == '__main__':
    bitset1 = {'0011', '1100', '1110'}
    bitset2 = {'01001', '11100', '10111'}
    print(XOR(bitset1))
    print(XOR(bitset2))
>>> python test.py
0001
00010
(Function int2bin stolen from here)
But I've found it works for some input, and not for others. In the test above it finds the correct solution for bitset2, but not for bitset1. Any solutions below O(2^n) available?
This question is in part algorithmic (what's the best algorithm to get to the solution) and in part a Python question (on what parts of Python to use to then efficiently implement that best algorithm).
On the algorithm: you define the max distance for a bit string to a set of (same size) bit strings to be the largest number of bits the target bit string differs on with any of the strings in the set. The goal of the algorithm is to find a new bit string with the same length as the strings in the set that has the lowest max distance.
It is assumed all the starting bit strings are different (as it is defined as a set, not a list). The distance you're computing is known as the Hamming distance, so you're looking for a new bit string with minimum Hamming distance to a starting set of strings.
Generating all possible bit strings of the right length and calculating the max distance to each starting string is brute forcing the problem, which can be optimised(*) using backtracking (discarding a result as soon as the lowest current max is exceeded for a candidate bit string).
(*: for people looking to correct my spelling, please consider the fact that I'm using UK English, not US English - feel free to propose improvements with that in mind)
However, the problem can also be viewed as follows.
For bit strings of length 1, the entire space of strings has only two options, {'0', '1'}. This can be visualised as '0' and '1' sitting at either end of a line segment of length 1, both a distance of 1 away from each other.
For bit strings of length 2, the entire space of strings has 4 options, namely the bit representations of 0 through 3 {'00', '01', '10', '11'} 0 is distance 1 away from 1 and 2, both of which are distance 1 away from 3. When visualised, they form the four corners of a square, none of them ever more than 2 steps from any other.
For bit strings of length 3, the entire space has 8 options, namely the bit representations of 0 through 7. When visualised, the form the 8 corners of a cube, none of them ever more than 3 steps from any other.
This pattern continues (into a 4D hypercube, 5D, etc.) and finding the answer to the problem effectively translates into: given a set of corners on one of these graphs, find the point with the lowest maximum distance to any of them.
An algorithm to find such a point, given a graph like that would be to:
This could probably be optimised further by keeping track of points visited as they are added to sets (for long bit strings), to avoid adding the same points over and over, quickly slowing down the given algorithm. I.e. instead of turning {'001'} into {'001', '101', '011', '000'}, you could go from [{'001'}] to [{'001'}, {'101', '011', '000'}] - the union of the sets still gets you all of the points reachable within 1 or fewer steps, but the next step in the series would be easier to compute, by finding all points that are one step farther out, but excluding points in the previous direction.
Finding points one step out is actually quite simple, if you turn the strings into the numbers the represent and calculate the exclusive bit-wise or of the numbers with all of the single '1'-bit numbers with the same bit string length, i.e. to find all points one step away from '001', you can xor 1 with 4, 2 and 1, yielding {5, 3, 0}, matching the correct points.
Putting all of that together in a tight bit of Python (without the optimisation for longer strings):
def closest(strings):
    if len(strings) == 1:
        return next(iter(strings))
    size = len(next(iter(strings)))
    points = [{int(s, 2)} for s in strings]
    powers = {1 << n for n in range(size)}
    d = 0
    while True:
        d += 1
        points = [{n ^ p for p in powers for n in nums} | nums for nums in points]
        intersection = set.intersection(*points)
        if len(intersection) > 0:
            return d, {f"{n:b}".zfill(size) for n in intersection}
print(closest({'1000', '0001', '0011'}))
Note that closest returns the actual distance and all of the optimal answers, not just one. Output:
(2, {'0000', '0010', '1001', '0001', '1011'})
Adding the discussed optimisation to closest:
def closest_optimised(strings):
    if len(strings) == 1:
        return next(iter(strings))
    size = len(next(iter(strings)))
    points = [({int(s, 2)}, {int(s, 2)}) for s in strings]
    powers = {1 << n for n in range(size)}
    d = 0
    while True:
        d += 1
        new_points = [{n ^ p for p in powers for n in rp} - ap for ap, rp in points]
        points = [(ap | np, np) for (ap, _), np in zip(points, new_points)]
        intersection = set.intersection(*[ap for ap, _ in points])
        if len(intersection) > 0:
            return d, {f"{n:b}".zfill(size) for n in intersection}
Note that running this through a profiler has the optimised code running in about half the time on average for these settings:
from random import randint
s = 10
x = 500
numbers = [randint(0, 2**s-1) for _ in range(x)]
number_strings = {f"{n:b}".zfill(s) for n in numbers}
print(number_strings)
print(closest_optimised(number_strings))
print(closest(number_strings))
Edit: Out of curiosity, I ran my example against the original given in the question and found that it frequently returns a far from optimal result. I haven't tried to find out why, but I figured it bears mentioning.
Someone pointed out that the OP may actually want the point with the maximum Hamming distance to all provided bit strings. With a similar approach:
def farthest(strings):
    s = next(iter(strings))
    size = len(s)
    if len(strings) == 1:
        return ''.join(['0' if c == '1' else '1' for c in s])
    all_visited = {int(s, 2) for s in strings}
    visited = [set(), all_visited]
    powers = {1 << n for n in range(size)}
    d = 0
    while True:
        d += 1
        visited.append({n ^ p for p in powers for n in visited[-1]} - all_visited)
        all_visited = all_visited | visited[-1]
        if len(all_visited) == 2**size:
            return d, {f"{n:b}".zfill(size) for n in visited[-1]}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With