I have a collection of four bitvectors, for example:
b1 = 00001010
b2 = 10100111
b3 = 10010010
b4 = 10111110
I would like to get the masks of those bits that are set in exactly 0, 1, 2, 3 or 4 of the given bitvectors. So m0 would be the mask of bits that are not set in any of the four bitvectors, m3 is the mask of those bits that are set in exactly three of the bitvectors, etc.:
m0 = 01000000
m1 = 00000001
m2 = 00111100
m3 = 10000000
m4 = 00000010
What is the fastest way to find these masks using bitwise operators?
I assume these have the least operations for 0 and 4 bits:
m0 = ~(b1 | b2 | b3 | b4)  // 4 ops
m4 = b1 & b2 & b3 & b4     // 3 ops
For the other options I'm not so sure my methods have the least operations:
m1 = ((b1 ^ b2) & ~(b3 | b4)) | (~(b1 | b2) & (b3 ^ b4)) // 9 operations
m2 = ((b1 ^ b2) & (b3 ^ b4)) | ((b1 ^ b3) & (b2 ^ b4)) | ((b1 ^ b4) & (b2 ^ b3)) // 11 operations
m3 = ((b1 ^ b2) & (b3 & b4)) | ((b1 & b2) & (b3 ^ b4)) // 7 operations
Is this the fastest way to calculate these masks or can I do it faster (in fewer operations)?
For most of the cases I need one or a few of these masks, not all of them at the same time.
(Note, in reality I will be doing this for 64 or 128 bit vectors. It is probably irrelevant, but I do it in C on a 32-bit x86 platform.)
14 operations for all masks.
The idea is to first sort the bits, using min = x & y and max = x | y as conditional swap. This costs 10 operations. Then simply extract the masks which costs 4 operations.
// Split in lower and upper half
var c1 = b1 & b2;
var c3 = b1 | b2;
var c2 = b3 & b4;
var c4 = b3 | b4;
// Sort within each half
var d1 = c1 & c2; // (b1 & b2) & (b3 & b4)
var d2 = c1 | c2; // (b1 & b2) | (b3 & b4)
var d3 = c3 & c4; // (b1 | b2) & (b3 | b4)
var d4 = c3 | c4; // (b1 | b2) | (b3 | b4)
// Sort middle
var e1 = d1;      // (b1 & b2) & (b3 & b4)
var e2 = d2 & d3; // ((b1 & b2) | (b3 & b4)) & ((b1 | b2) & (b3 | b4))
var e3 = d2 | d3; // ((b1 & b2) | (b3 & b4)) | ((b1 | b2) & (b3 | b4))
var e4 = d4;      // (b1 | b2) | (b3 | b4)
// Extract masks
var m4 = e1;      // (b1 & b2) & (b3 & b4)
var m3 = e2 ^ e1; // (((b1 & b2) | (b3 & b4)) & ((b1 | b2) & (b3 | b4))) ^ ((b1 & b2) & (b3 & b4))
var m2 = d3 ^ d2; // The same as e3 ^ e2, saves two operations if only m2 is required
var m1 = e4 ^ e3; // ((b1 | b2) | (b3 | b4)) ^ (((b1 & b2) | (b3 & b4)) | ((b1 | b2) & (b3 | b4)))
var m0 = ~e4;     // ~((b1 | b2) | (b3 | b4))
(The code is in C#, but it's trivial to convert this to C.)
If you use this code calculate only some masks and simply remove the lines that don't affect the result (a decent compiler should do this automatically). The performance isn't bad either:
m4: 3 operations
m3: 9 operations
m2: 7 operations
m1: 9 operations
m0: 4 operations
Start by considering the trivial case where the bit-vectors have length 1, i.e. they're just single bits. What you're really trying to do is to count the number of these bits that are set; mapping the count to the bitmask you want is then a relatively trivial exercise.
The trick is to find a way to count the bits using only bitwise operations (i.e. AND, OR, XOR and NOT), with each bit of the resulting count ending up in a separate variable. If you can do that, then you can do it simultaneously for as many bits as fit in a variable. This technique is known as bit-slicing or SIMD-within-a-register (SWAR).
So what you're basically trying to do is to implement an n-input single-bit binary adder (where, in your case, n = 4) using bitwise logic operations. Fortunately, this problem has been extensively studied by digital circuit designers.
The general solution involves maintaining an array of k bit vectors t1, t2, ..., tk (where 2k > n) storing the bits of the counts, and adding each input bit vector to the counts one at a time:
// initialize counts to zero
int count[k];
for (int j = 0; j < k; j++) count[j] = 0;
// count input bits
for (int i = 0; i < n; i++) {
    int carry = input[i];
    for (int j = 0; j < k && carry != 0; j++) {
        int temp = count[k];
        count[k] = carry ^ temp;
        carry = carry & temp;
    }
    // XXX: if carry != 0 here, some of the counts have overflowed
}
Then you can extract your bit masks from the counts:
int masks[n+1];
for (int i = 0; i <= n; i++) {
    masks[n] = ~0;  // initialize all mask bits to 1
    for (int j = 0; j < k; j++) {
        masks[n] &= (n & (1 << j) ? count[j] : ~count[j]);
    }
}
Of course, if the number of inputs is small and fixed, we can optimize the code for that specific value. For example, for n = 4, we can use:
// count input bits, store counts in bit-planes c0, c1, c2
int c0 = b0 ^ b1 ^ b2 ^ b3;
int c2 = b0 & b1 & b2 & b3;
int c1 = ((b0 & b1) | ((b0 ^ b1) & b2) | ((b0 ^ b1 ^ b2) & b3)) & ~c2;
// build masks from bit-planes
int m0 = ~c0 & ~c1 & ~c2;
int m1 =  c0 & ~c1 & ~c2;
int m2 = ~c0 &  c1 & ~c2;
int m3 =  c0 &  c1 & ~c2;
int m4 =  c2;  // XXX: count cannot be more than 4
A completely naïve compiler would generate 23 AND / OR / XOR operations and 9 NOTs from this code.  However, a decent compiler should cache the values of ~c0, ~c1 and ~c2, saving up to 6 NOTs, and probably also some of the repeated subexpressions like b0 & b1, b0 ^ b1, b0 ^ b1 ^ b2, ~c1 & ~c2 and c1 & ~c2, saving up to 6 AND/XORs, for a total of 17 AND/OR/XORs plus 3 NOTs = 20 ops, which is pretty close to Jonathan Mee's 19-op solution.
In fact, we can do a little better by realizing that we don't actually need to calculate c1, but can instead work with c12 = c1 | c2.  We can also optimize the mask building a bit further by noting that c2 cannot be set if c0 (or c1) is:
// count input bits, store counts in bit-planes c0, (c1 = c12 & ~c2), c2
int c0 = b0 ^ b1 ^ b2 ^ b3;
int c2 = b0 & b1 & b2 & b3;
int c12 = ((b0 & b1) | ((b0 ^ b1) & b2) | ((b0 ^ b1 ^ b2) & b3));
// build masks from bit-planes
int m0 = ~c0 & ~c12;
int m1 =  c0 & ~c12;       // c0 implies ~c2
int m2 = ~c0 &  c12 & ~c2;
int m3 =  c0 &  c12;       // c0 implies ~c2
int m4 =  c2;              // c2 implies ~c0 & ~c1
That's 19 AND/ORs and 5 NOTs for a naïve compiler, but trivial common subexpression optimization should reduce that to 15 AND/ORs and 3 NOTs, for a total of 18 ops.
(Of course, the real performance gain on modern processors will come from instruction reordering to reduce pipeline stalls. I suspect this code should do fairly well in that respect: while the masks obviously depend on the counts, and the counts on the inputs, there are no internal dependencies within either the counts or the masks, so there should be plenty of room for reordering.)
Update 23 Nov 2014: The code I had above earlier was untested, and contained a bug in the expression for c1 / c12.  I've fixed it now, and even managed to make it slightly more optimizable, saving one op after common subexpression elimination (but costing one extra op for a naïve compiler).  It still uses more ops than CodesInChaos' sort-based solution, though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With