Bitcounting can be done in several ways, eg. with set bit iterator, unset bit iterator, pre-computed bits with lookup tables or parallel counting. As I have figured out by searching the web, unset bit iterator is fast when there are less unset bits, and set bit iterator the opposite. But when should you use parallel counting, MIT HAKMEM (seen below) in particular? It seems quite fast, although probably slower then lookup tables. Is it always better compared to set/unset bit in terms of speed? Are there some other conserns regarding which one to choose than speed and memory?
 int BitCount(unsigned int u) {
     unsigned int uCount;
     uCount = u - ((u >> 1) & 033333333333) - ((u >> 2) & 011111111111);
     return ((uCount + (uCount >> 3)) & 030707070707) % 63;
 }
Why choose one bit counting method over another? Well it really depends on your machine and the problem you're trying to solve. Note that all the instruction counts I give below are for a basic RISC processor and might not translate well to a more complicated beast like x86.
The HAKMEM algorithm you quoted will execute in 13 instructions but is unlikely to be very fast due to the modulus operator. By eye-balling it, it does look like it has some pretty good instruction level parallelism which should help if your processor is capable of exploiting that.
The algorithm Bo Persson presented is quite fast (2 + 5*pop(x) instructions) but only if the word is sparsely populated. It can also be modified to work on densely populated words. It also contain branches and doesn't have any significant instruction level parallelism.
EDIT: The table lookup method can also be very fast but does make memory accesses. If the entire table is in the L1 cache then it's probably one of the fastest algorithms. If the table isn't in cache then it's almost certainly one of the slowest.
The algorithm below is a variation of one of the HAKMEM algorithm and is presented in the book Hacker's Delight (I highly recommend this book if you like this sort of things). It executes in 19 instructions and is branch-free. It also doesn't use a division but does have a multiplication. It's also very economical in the way it uses registers by re-using the same mask as much as possible. Still no significant instruction level parallelism here that I can see.
int pop(unsigned x) {
  unsigned n;
  n = (x >> 1) & 0x77777777;
  x = x - n;
  n = (n >> 1) & 0x77777777;
  x = x - n;
  n = (n >> 1) & 0x77777777;
  x = x - n;
  x = (x + (x >> 4)) & 0x0F0F0F0F;
  x = x * 0x01010101;
  return x >> 24;
}
The Hacker's Delight book also presents a couple of even more specialised algorithms for 9-8-7 bit wide fields or using floating point operators. Note that most of the analysis I presented above were also partially taken from that book as well.
The fact is that there's a truck load of methods and the only way to be sure which works best in your particular situation is to measure and compare. I do realise that this is a pretty canned answer but the alternative is to know your processor and compiler inside out.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With