I realize I've got a great lack of knowledge in that area (fancy way of saying I don't know jack about it).
Is there some documentation as to how and when to use them?
Branchless programming is a programming technique that eliminates the branches (if, switch, and other conditional statements) from the program. Although this is not much relevant these days with extremely powerful systems and usage of interpreted languages( especially dynamic typed ones).
It is just a collection of techniques to reduce the number of branches. Branches are things like “if” statements. Anything that causes the CPU to execute a conditional jump. Branches are often slow, and so these techniques were developed as performance programming tools. Related questions (More answers below)
This is not branchless. It's if-less. But the ternary operator (?:) is a branching instruction. In case anybody was wondering, the x86 assembly is doing the multiply trick.
Branchless code can be faster in low-level languages because it prevents slowdowns caused by branch prediction misses. (Note: this means branchless code can be faster, but it will not necessarily be.) Python is a high-level language.
Apart from all the twiddling based branchless code (which won't cover everything, such as FP), you get instructions specifically meant for branchless code creation, these would be SETcc, FCMOVcc and CMOVcc under x86, which perform operations based on the condition flags from a comparison.
a really simple example would be (yes, the example is so simple that one would probably never write something like this, its only to demonstrated a point clearly):
bool CheckZero(int x)
{
    if(x == 0)
       return true;
    return false;
    //OR we can use: return (x == 0) ? true : false; 
}
now a simple x86 compiler might compile that down to:
    MOV EAX,[ESP + 4]
    TEXT EAX,EAX
    JE _TRUE
    XOR EAX,EAX
    RETN 4
_TRUE:
    MOV EAX,1
    RETN 4
an optimizing x86 compiler would take it down into the following branchless code (or similar):
MOV ECX,[ESP + 4]
XOR EAX,EAX
TEST ECX,ECX
SETE AL
RETN 4
A little more complex example can be found here.
However, this is something that a compiler will perform, and not some you should worry about yourself (at least not without analyzing the output of your compiler). but if the code is required to be branchless without fail, C++ doesn't provide enough control to do so, so you need to use (inline) assembly.
I had written ternary logic simulator not so long ago, and this question was viable to me, as it directly affects my interpretator execution speed; I was required to simulate tons and tons of ternary logic gates as fast as possible.
In a binary-coded-ternary system one trit is packed in two bits. Most significant bit means negative and least significant means positive one. Case "11" should not occur, but it must be handled properly and threated as 0.
Consider inline int bct_decoder( unsigned bctData ) function, which should return our formatted trit as regular integer -1, 0 or 1; As i observed there are 4 approaches: i called them "cond", "mod", "math" and "lut"; Lets investigate them
First is based on jz|jnz and jl|jb conditional jumps, thus cond. Its performance is not good at all, because relies on a branch predictor. And even worse - it varies, because it is unknown if there will be one branch or two a priori. And here is an example:
inline int bct_decoder_cond( unsigned bctData ) {
   unsigned lsB = bctData & 1;
   unsigned msB = bctData >> 1;
   return
       ( lsB == msB ) ? 0 : // most possible -> make zero fastest branch
       ( lsB > msB ) ? 1 : -1;
}
This is slowest version, it could involve 2 branches in worst case and this is something where binary logic fails. On my 3770k it prodices around 200MIPS on average on random data. (here and after - each test is average from 1000 tries on randomly filled 2mb dataset)
Next one relies on modulo operator and its speed is somewhere in between first and third, but is definetely faster - 600 MIPS:
inline int bct_decoder_mod( unsigned bctData ) {
    return ( int )( ( bctData + 1 ) % 3 ) - 1;
}
Next one is branchless approach, which involves only maths, thus math; it does not assume jump instrunctions at all:
inline int bct_decoder_math( unsigned bctData ) {
    return ( int )( bctData & 1 ) - ( int )( bctData >> 1 );
}
This does what is should, and behaves really great. To compare, performance estimate is 1000 MIPS, and it is 5x faster than branched version. Probably branched version is slowed down due to lack of native 2-bit signed int support. But in my application it is quite good version in itself.
If this is not enough then we can go futher, having something special. Next is called lookup table approach:
inline int bct_decoder_lut( unsigned bctData ) {
    static const int decoderLUT[] = { 0, 1, -1, 0 };
    return decoderLUT[ bctData & 0x3 ];
}
In my case one trit occupied only 2 bits, so lut table was only 2b*4 = 8 bytes, and was worth trying. It fits in cache and works blazing fast at 1400-1600 MIPS, here is where my measurement accuracy is going down. And that is is 1.5x speedup from fast math approach. That's because you just have precalculated result and single AND instruction. Sadly caches are small and (if your index length is greater than several bits) you simply cannot use it.
So i think i answered your question, on what what could branched/branchless code be like. Answer is much better and with detailed samples, real world application and real performance measurements results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With