When considering a conditional function call in a critical section of code I found that both gcc and clang will branch around the call. For example, for the following (admittedly trivial) code:
int32_t __attribute__((noinline)) negate(int32_t num) {
    return -num;
}
int32_t f(int32_t num) {
    int32_t x = num < 0 ? negate(num) : num;
    return 2*x + 1;
}
Both GCC and clang compile to essentially the following:
.global _f
_f:
    cmp     edi, 0
    jg      after_call
    call    _negate
after_call:
    lea     rax, [rax*2+1]
    ret
This got me thinking: what if x86 had a conditional call instruction like ARM? Imagine if there was such an instruction "ccallcc" with semantics like cmovcc. Then you could do something like:
.global _f
_f:
    cmp     edi, 0
    ccalll  _negate
    lea     rax, [rax*2+1]
    ret
Although we can't avoid the branch prediction, we do eliminate a branch. Namely, in the actual GCC/clang output, we are forced to branch regardless of whether num < 0 or not. And if num < 0 we have to branch twice. This seems wasteful.
Now such an instruction doesn't exist in amd64, but I devised a way to simulate such an instruction. I did this by breaking call func into its component parts: push rip (well technically [rip+label_after_call_instruction]) and then jmp func. We can make the jmp conditional, but there is no conditional push. We can simulate this by computing [rip+label_after_call_instruction] and writing it to the appropriate location on the stack, then conditionally updating rsp if we plan to call the function (which actually "pushes" the [rip+label_after_call_instruction]). It looks something like this:
.global _f
_f:
    cmp     edi, 0
    # ccalll _negate
    lea     rax, [rip+after_ccall]  # Compute return address
    mov     [rsp-8], rax            # Prepare to "push" return address
    lea     rax, [rsp-8]            # Compute rsp (after push)
    cmovl   rsp, rax                # Conditionally push (by actually changing rsp)
    jl      _negate                 # "Conditional call"
after_ccall:
    lea     rax, [rax*2+1]
    ret
There are a few potential downsides to this approach:
leas and mov even if the call isn't made (but my understanding is this doesn't matter as cmovcc takes the same number of cycles as mov, for example)To examine the properties of each of these approaches, I ran the critical sections through iaca. If you have it installed (and you clone my benchmark gist below), you can run make iaca to see for yourself. Pass IACAFLAGS='-arch=...' to specify a different arch.
The output for the branch over approach:
Intel(R) Architecture Code Analyzer Version -  v3.0-28-g1ba2cbb build date: 2017-10-30;16:57:45
Analyzed File -  ./branch_over_call_iaca.o
Binary Format - 64Bit
Architecture  -  SKL
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 0.82 Cycles       Throughput Bottleneck: Dependency chains
Loop Count:  36
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles |  0.5     0.0  |  0.0  |  0.3     0.0  |  0.3     0.0  |  1.0  |  0.0  |  0.5  |  0.3  |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1      | 0.5         |      |             |             |      |      | 0.5  |      | jnle 0x6
|   4^#    |             |      | 0.3         | 0.3         | 1.0  |      |      | 0.3  | call 0x5
Total Num Of Uops: 5
And the output for the conditional call approach:
Intel(R) Architecture Code Analyzer Version -  v3.0-28-g1ba2cbb build date: 2017-10-30;16:57:45
Analyzed File -  ./conditional_call_iaca.o
Binary Format - 64Bit
Architecture  -  SKL
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 1.94 Cycles       Throughput Bottleneck: Dependency chains
Loop Count:  35
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles |  1.0     0.0  |  1.0  |  0.5     0.0  |  0.5     0.0  |  1.0  |  1.0  |  1.0  |  0.0  |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1      |             | 1.0  |             |             |      |      |      |      | lea rax, ptr [rip]
|   2^     |             |      | 0.5         | 0.5         | 1.0  |      |      |      | mov qword ptr [rsp-0x8], rax
|   1      |             |      |             |             |      | 1.0  |      |      | lea rax, ptr [rsp-0x8]
|   1      | 1.0         |      |             |             |      |      |      |      | cmovl rsp, rax
|   1      |             |      |             |             |      |      | 1.0  |      | jl 0x6
Total Num Of Uops: 6
I looks like the conditional call approach appears to use more of the hardware. But I found it interesting that the conditional approach only has 1 more uop (the branch over approach had 5 uops). I guess this makes sense given that under the hood the call turns into a push and jmp (and the push turns into rsp math and a memory mov). This would suggest to me that the conditional call approach is approximately equivalent (although maybe my simplistic analysis is flawed here?).
At the least, my overarching suspicion that was by introducing several instructions between the cmp and jl, I'd make it possible that the result of the cmp would be available before the jl can be speculatively executed (thus preventing the branch prediction at all). Although maybe the pipeline is longer than this? This treads into areas with which (despite having read and retained a medium-depth understanding of Agner Fog's optimization manuals) I am not very familiar.
My hypothesis is that for a uniform distribution of (negative and positive) nums (where branch prediction won't be able to predict the branch around the call) that my "conditional call" approach will outperform branching around the call.
I wrote a harness to benchmark the performance of these two approaches. You can git clone https://gist.github.com/baileyparker/8a13c22d0e26396921f501fe87f166a9 and make to run the benchmarks on your machine. 
Here's the runtime of 100 iterations of each approach on an array of 1,048,576 numbers (uniformly distributed between int32_t min and max).
|                    CPU                    | Conditional Call | Branch Over |
|-------------------------------------------|-----------------:|------------:|
| Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz |       10.9872 ms |   8.4602 ms |
| Intel(R) Xeon(R) CPU E3-1240 v6 @ 3.70GHz |        8.8132 ms |   7.0704 ms |
These results are consistent across runs and although magnified by increasing the array size (or number of iterations), branching over always wins.
I also tried reordering the conditional call steps (computing and conditionaly updating rsp first, then writing to the stack) but this performed similarly.
What hardware detail that am I missing (or misunderstanding) explains this? From my calculations the extra instructions add somewhere around 6-7 cycles, but a branch mispredict costs 15. So, on average half the numbers are predicted wrong so each iteration costs 15/2 cycles (for the branch over approach) and always 6-7 cycles for the conditional call. The uops from iaca suggest the approaches are even closer in this regard. So, shouldn't the performance be closer? Is my example code too contrived/short? Is my benchmarking technique not appropriate for this kind of low level critical section testing? Is there a way to reorder/change the conditional call to make it more performant (better or comparable to the branch over approach, maybe)?
tl;dr Why does my conditional call code (4th code snippet) perform worse than what gcc/clang produces (conditional jump over the call) (2nd code snippet) (for the code in the 1st snippet) on my benchmark?
You can exactly determine why the conditional_call approach is slower than branch_over_call. You've done your experiments on two KBL processors, but the blog post you were referred to doesn't discuss how the RAS works on KBL. So the first step of the analysis is to determine whether the ret in the negate function is being mispredicted or not (as what would happen on earlier microarchitectures). The second step is to determine what is the cost of mispredicting that ret instruction on total execution time. The closest thing that I have to KBL is CFL and my numbers turned out to be close to yours. The only relevant difference between the two is that LSD is enabled in CFL but disabled in KBL. However, the LSD is irrelevant in this case because of the call instruction in the loop which prevents the LSD from detecting any loop. You can also easily repeat the same analysis on KBL.
There are several ways to analyze the behavior of branch instructions. But in this particular case, the code is simple enough for the event counting method to reveal all the information that we need about every static branch instruction.
The BR_INST_RETIRED_* performance events can be used count the total number of dynamic branch instructions retired and the total number of specific types of retired branch instructions including conditional, calls, and returns. The BR_MISP_RETIRED_* events can be used to count total mispredictions, total conditional mispredictions, and total call mispredictions.
The complete control-glow graph of conditional_call looks like this:
           total   misp
call         1      0
    jl       1     0.5
       ret  0.5     1
    ret      1      0
jne          1      0
The first call instruction calls the conditional_call function, which contains jl and ret. The jl instruction conditionally jumps to the negate function, which contains ret. The jne instruction is used to for the loop. The numbers shown in the first and second column are normalized by the total number of iterations and total number of dynamic instructions, respectively. We know from the static structure of the program that call, jl, conditional_call's ret, and jne are each executed once in every iteration. The inner most ret is only executed when the jl branch is taken. Using the performance events, we can count the total number of executed return instructions and subtract from it the total number of iterations to get the number of times the inner most ret is executed. Because the input is randomized according to the uniform distribution, it shouldn't be surprising that the inner most ret is executed half of the time.
The call instruction is never mispredicted. The jne instruction is also never mispredicted except for the last execution of the instructions (where it exits the loop). Therefore, we can attribute the total number of conditional mispredictions to the jl instruction. That can be subtracted from the total number of mispredictions to get the number of return mispredictions which can be attributed to either or both of the return instructions. The second ret may mispredict when the misprediction of the first ret clobbers or misaligns the RAS. One way to determine whether the second ret is ever mispredicted is by using precise sampling of BR_MISP_RETIRED.ALL_BRANCHES. Another way is by using the method described in the blog post you cited. Indeed, only the inner most ret is mispredicted. The fact that jl is mispredicted half of the time suggests that the instruction is either being predicted always taken or always not taken.
The complete control-glow graph of branch_over_call looks like this:
           total   misp
call         1      0
    jg       1     0.5
    call    0.5     0
        ret 0.5     0
    ret      1      0
jne          1      0
The only instruction that is mispredicted is jg, which is mispredicted half of the time.
To measure the average cost of a single ret misprediction in the conditional_call approach, the ret instruction can be replaced with a lea/jmp sequence so that BTB rather than the RAS is used for making predictions. With this change, the only instruction that is mispredicted is jl. The difference in execution time can be considered as an estimate for the total cost of ret mispredictions. On my CFL processor, this is about 11.3 cycles per ret misprediction. In addition, conditional_call has become about 3% faster than branch_over_call. Your numbers on KBL indicate that the average cost of a ret misprediction is about 13 cycles. I'm not sure what the reason for this difference is. It may not be microarchitectural. I've used gcc 7.3 but you used gcc 8, so perhaps there are some differences in the code or the alignments of different pieces of code that are causing the discrepancy between our results.
As @fuz pointed out in the comments, the performance issue is almost certainly due to the Return Address Stack (RAS), which is a specialized branch predictor for function returns.
As an advantage of having separate call and ret instructions from jmp and manual stack modification, CPUs are clued in to the intent of the running code. In particular, when we call a function it is probably going to ret and when it does we are going to jump back to the rip pushed before the call. In other words, calls are usually paired with a ret. The CPU leverages this by keeping a fixed-length stack of just return addresses called the return address stack (RAS). call instructions in addition to pushing the return address to the actual in-memory stack will additionally push it to the RAS. This way, when a ret is encountered the CPU can pop off of the RAS (which is much faster than the memory access for the actual stack) and speculatively execute the return. If it turns out that the address popped from the RAS was the one popped from the stack, the CPU continues with no penalty. However, if the RAS predicted the wrong return address, a pipeline flush occurs, which is costly.
My original intuition was that the conditional instructions would be better because they would give time for the result of the comparison to arrive before the jump. However, whatever benefit that may have provided, having an unbalanced jmp/ret (my conditional call replaced call with jmp, but the called function still used a ret) caused the RAS to likely always predict the wrong return address (and thus my approach, despite originally trying to avoid this, causes more pipeline stalls). The speedup from the RAS is more significant than my "optimization" so the branching approach outperformed the conditional call approach.
According to some empirical results mismatching call and ret (in particular using a jmp + ret) take 5-6 times more cycles than properly pairing call and ret. Some napkin math would suggest that a penalty of +21 cycles at 3.1GHz for 1,048,576 calls add about 7.1ms to the total runtime. The slowdown observed was less than that. This is likely a combination of the conditional instructions delaying the jump until the condition was ready and the fact that the jumps were oscillating between fixed locations in memory (which the other branch predictors likely became good at predicting).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With