Speeding up a bit test operation by using assembly

Question

I have a performance problem - I can't beat the release version speed of the compiler generated code at all. It is 25% slower. The function I wrote is being called around 20 million times in my test, so making it run faster will pay off.

The code in C++ is pretty simple:

static inline char GetBit(char *data, size_t bit)
{ 
    return 0 != (data[bit / 8] & (1 << (bit % 8))); 
}

And this is the version I wrote for 64-bit MASM:

    mov   rax, rdx  
    mov   r10, 8h
    xor   rdx, rdx  
    div   rax, r10  
    mov   al, byte ptr [rax+rcx]  
    mov   bl, 1h
    mov   cl, dl  
    shl   bl, cl  
    and   al, bl  
    shr   al, cl  
    ret

Well I'm not much of an assembly guy, but I don't think the compiler can make 25% faster code, just creating better assembly. So the trick is (probably]) in the function call. It respects the inline keyword for the C++ code and generates no call, but I just can't make it working for the asm code:

extern "C" inline char GetBitAsm(char *data, size_t bit);

I've disassembled the code using dumpbin and I can clearly see my code + function call, while no call is generated for the compiler's version:

    mov   rdx, qword ptr [bit]  
    mov   rcx, qword ptr [data]  
    call  GetBitAsm (013F588EFDh)  
    mov   byte ptr [isbit], al

There are additionally two readings from and one writing to the memory, while in what the compiler generates, there is probably only 1 reading. I read somewhere that div opcode takes around 20 cycles, while single memory access costs 100 cycles at least. So removing mov rdx and mov rcx from the memory, replacing them with values from registers in the parent function will, I think, do the trick.

Questions:

Is that really the reason it runs so slow ?
How to make function written in asm, inline in release version ?
How can I further enhance my assembly code, to make it even faster ?

Ross Ridge · Accepted Answer

The function call overheard and the DIV instruction in your assembly code are going to kill performance, relative to any compiler's inlined code. Function call overhead alone may be greater the number of cycles the compiler's code takes on average. The DIV instruction could be several times greater.

Memory accesses are often free on modern processors because they can be satisfied from the processor's caches. In your assembly version your memory accesses would on average cost 0 cycles because your code is probably slow enough that the processor can easily prefetch memory into its caches before it needs to access it. On the other hand, the compiler's code is potentially fast enough that it could be reading values from memory faster than the processor can fetch it. It would have to stall periodically waiting for the fetches to compile. So while memory access cycle times on average are going to be higher in the compiler code, this is only because it's much better optimized.

The best solution to your problem is to let the compiler do the optimization. To be quite frank, it seems it knows how to generate better code than you do. Even an assembly expert would have a hard time improving on the compiler, and it would require looking at the problem at a wider scope than just instruction selection of this one function.

If you still rather use your own assembly code then use you compiler's inline assembly functionality, and get rid of the DIV instruction. It still won't perform as well as the compiler's version but that should get it a lot closer.

Stian Svedenborg · Answer

I'll take a long shot here and speculate a bit on what you are trying to do, so bear with me:

There are a few things that strike me with your code, (both the C++ and the assembly) the first is as has been mentioned by others that you use div and mod. These operations are rather slow, and one of the reasons you will not be able to compete with your compiler is that, most likely, it will optimize these operations away.

You are working with powers of 2, the computer was made for working with powers of 2. That means that this is equivalent to your code:

static inline char GetBit(char *data, size_t bit)
{ 
    return 0 != (data[bit >> 3] & (1 << (bit & 0x07))); 
}

You can use this to improve your assembly, but that will not likely give you much performance improvement.

If you on the other hand is aiming to speed up your code I will suggest the following changes:

In your large bitmask change the base type to your processors native size, that is an uint32_t for 32 bit machines and an uint64_t for 64 bit machines.

Also, divide your getBit() function into two functions, getWord() and getBit().

getWord() should be something long the lines:

static inline uint32_t getWord(const uint32_t *data, size_t bit) { 
    return data[ bit / sizeof(*data)*8 ]; // Again, the compiler will most 
                                          // likely pick up that this is a 
                                          // division by a power of 2 and 
                                          // optimize accordingly.
                                          // Check to be certain.
}

static inline uint32_t getBit(const uint32_t *data, size_t bit) { 
     return getWord(data, bit) & (1 << (bit & (sizeof(*data)*8 - 1)); 
     // Or just % like above, check which is faster. 
}

The real speed up should come if you rewrite the code using this bitmask:
1. If you are jumping around a lot in your buffer, you will probably only get a slight improvement from the suggestions above.
2. If, however you are iterating though the data in a linear fashion I suggest changing your code to this:
```
 uint32_t mask = 1;
 uint32_t word;
 for ( int bit = 0; bit < 2048; i++) {
     word = getWord(buffer, i); // You could also move this outside a smaller loop, but I'm not sure it's worth it.
     if (word & mask) {
         cout << "Bit " << bit << " is set." << endl;
     }

     // Most modern compilers will recognize the following as a ROL 
     // (ROtational Left shift) and replace it with one instruction.
     mask = (mask << 1 | mask >> (sizeof(mask)*8-1)); 
 }
```
The reason this is a good idea is that the processor is optimized to work with native size integers, you avoid problems with alignment, upgrading values in register etc. You may also notice that by using a mask all the way outside in the loop you avoid an extra shift/division as we simply lets the mask roll around when it fills.

Speeding up a bit test operation by using assembly

Tags:

c++

performance

compiler-optimization

assembly

x86-64

user3834213

2 Answers

Ross Ridge

Stian Svedenborg

Recent Activity

Donate For Us

Speeding up a bit test operation by using assembly

Tags:

c++

performance

compiler-optimization

assembly

x86-64

user3834213

2 Answers

Ross Ridge

Stian Svedenborg

Related questions

Recent Activity

Donate For Us