I'm compiling this C code:
int mode; // use aa if true, else bb int aa[2]; int bb[2];  inline int auto0() { return mode ? aa[0] : bb[0]; } inline int auto1() { return mode ? aa[1] : bb[1]; }  int slow() { return auto1() - auto0(); } int fast() { return mode ? aa[1] - aa[0] : bb[1] - bb[0]; } Both slow() and fast() functions are meant to do the same thing, though fast() does it with one branch statement instead of two.  I wanted to check if GCC would collapse the two branches into one.  I've tried this with GCC 4.4 and 4.7, with various levels of optimization such as -O2, -O3, -Os, and -Ofast.  It always gives the same strange results:
slow():
        movl    mode(%rip), %ecx         testl   %ecx, %ecx         je      .L10          movl    aa+4(%rip), %eax         movl    aa(%rip), %edx         subl    %edx, %eax         ret .L10:         movl    bb+4(%rip), %eax         movl    bb(%rip), %edx         subl    %edx, %eax         ret fast():
        movl    mode(%rip), %esi         testl   %esi, %esi         jne     .L18          movl    bb+4(%rip), %eax         subl    bb(%rip), %eax         ret .L18:         movl    aa+4(%rip), %eax         subl    aa(%rip), %eax         ret Indeed, only one branch is generated in each function.  However, slow() seems to be inferior in a surprising way: it uses one extra load in each branch, for aa[0] and bb[0].  The fast() code uses them straight from memory in the subls without loading them into a register first.  So slow() uses one extra register and one extra instruction per call.
A simple micro-benchmark shows that calling fast() one billion times takes 0.7 seconds, vs. 1.1 seconds for slow().  I'm using a Xeon E5-2690 at 2.9 GHz.
Why should this be? Can you tweak my source code somehow so that GCC does a better job?
Edit: here are the results with clang 4.2 on Mac OS:
slow():
        movq    _aa@GOTPCREL(%rip), %rax   ; rax = aa (both ints at once)         movq    _bb@GOTPCREL(%rip), %rcx   ; rcx = bb         movq    _mode@GOTPCREL(%rip), %rdx ; rdx = mode         cmpl    $0, (%rdx)                 ; mode == 0 ?         leaq    4(%rcx), %rdx              ; rdx = bb[1]         cmovneq %rax, %rcx                 ; if (mode != 0) rcx = aa         leaq    4(%rax), %rax              ; rax = aa[1]         cmoveq  %rdx, %rax                 ; if (mode == 0) rax = bb         movl    (%rax), %eax               ; eax = xx[1]         subl    (%rcx), %eax               ; eax -= xx[0] fast():
        movq    _mode@GOTPCREL(%rip), %rax ; rax = mode         cmpl    $0, (%rax)                 ; mode == 0 ?         je      LBB1_2                     ; if (mode != 0) {         movq    _aa@GOTPCREL(%rip), %rcx   ;   rcx = aa         jmp     LBB1_3                     ; } else { LBB1_2:                                    ; // (mode == 0)         movq    _bb@GOTPCREL(%rip), %rcx   ;   rcx = bb LBB1_3:                                    ; }         movl    4(%rcx), %eax              ; eax = xx[1]         subl    (%rcx), %eax               ; eax -= xx[0] Interesting: clang generates branchless conditionals for slow() but one branch for fast()!  On the other hand, slow() does three loads (two of which are speculative, one will be unnecessary) vs. two for fast().  The fast() implementation is more "obvious," and as with GCC it's shorter and uses one less register.
GCC 4.7 on Mac OS generally suffers the same issue as on Linux.  Yet it uses the same "load 8 bytes then twice extract 4 bytes" pattern as Clang on Mac OS.  That's sort of interesting, but not very relevant, as the original issue of emitting subl with two registers rather than one memory and one register is the same on either platform for GCC.
GCC has a range of optimization levels, plus individual options to enable or disable particular optimizations. The overall compiler optimization level is controlled by the command line option -On, where n is the required optimization level, as follows: -O0 . (default).
The compiler optimizes to reduce the size of the binary instead of execution speed. If you do not specify an optimization option, gcc attempts to reduce the compilation time and to make debugging always yield the result expected from reading the source code.
Turning on optimization flags makes the compiler attempt to improve the performance and/or code size at the expense of compilation time and possibly the ability to debug the program. The compiler performs optimization based on the knowledge it has of the program.
Compiler specific pragma gcc provides pragma GCC as a way to control temporarily the compiler behavior. By using pragma GCC optimize("O0") , the optimization level can be set to zero, which means absolutely no optimize for gcc.
The reason is that in the initial intermediate code, emitted for slow(), the memory load and the subtraction are in different basic blocks:
slow () {   int D.1405;   int mode.3;   int D.1402;   int D.1379;    # BLOCK 2 freq:10000   mode.3_5 = mode;   if (mode.3_5 != 0)     goto <bb 3>;   else     goto <bb 4>;    # BLOCK 3 freq:5000   D.1402_6 = aa[1];   D.1405_10 = aa[0];   goto <bb 5>;    # BLOCK 4 freq:5000   D.1402_7 = bb[1];   D.1405_11 = bb[0];    # BLOCK 5 freq:10000   D.1379_3 = D.1402_17 - D.1405_12;   return D.1379_3; } whereas in fast() they are in the same basic block:
fast () {   int D.1377;   int D.1376;   int D.1374;   int D.1373;   int mode.1;   int D.1368;    # BLOCK 2 freq:10000   mode.1_2 = mode;   if (mode.1_2 != 0)     goto <bb 3>;   else     goto <bb 4>;    # BLOCK 3 freq:3900   D.1373_3 = aa[1];   D.1374_4 = aa[0];   D.1368_5 = D.1373_3 - D.1374_4;   goto <bb 5>;    # BLOCK 4 freq:6100   D.1376_6 = bb[1];   D.1377_7 = bb[0];   D.1368_8 = D.1376_6 - D.1377_7;    # BLOCK 5 freq:10000   return D.1368_1; } GCC relies on instruction combining pass to handle cases like this (i.e. apparently not on the peephole optimization pass) and combining works on the scope of a basic block. That's why the subtraction and load are combined in a single insn in fast() and they aren't even considered for combining in slow().
Later, in the basic block reordering pass, the subtraction in slow() is duplicated and moved into the basic blocks, which contain the loads. Now there's opportunity for the combiner to, well, combine the load and the subtraction, but unfortunately, the combiner pass is not run again (and perhaps it cannot be run that late in the compilation process with hard registers already allocated and stuff).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With