Trying to get usable 128-bit operations in GCC on amd64, I implemented some inline functions. Like add_128_128_128. I wanted to let the compiler decide, which registers to use as inputs and outputs for most flexibility. So, I used the multiple alternative constraints.
inline __uint128_t add_128_128_128(__uint128_t a, __uint128_t b) {
        uint64_t a_hi = a >> 64;
        uint64_t a_lo = a;
        uint64_t b_hi = b >> 64;
        uint64_t b_lo = b;
        uint64_t retval_hi;
        uint64_t retval_lo;
        asm (
                "\n"
                "       add     %2, %0\n"
                "       adc     %3, %1\n"
                : "=r,r,r,r" (retval_lo)
                , "=r,r,r,r" (retval_hi)
                : "r,0,r,0" (a_lo)
                , "0,r,0,r" (b_lo)
                , "r,1,1,r" (a_hi)
                , "1,r,r,1" (b_hi)
        );
        return ((__uint128_t)retval_hi) << 64 | retval_lo;
}
Now, the generated assembler output is:
_Z11add_128_128oo:
        movq    %rdx, %rax
        movq    %rcx, %rdx
        add     %rdi, %rax
        adc     %rax, %rdx
        ret
What puzzles me is how to get the adc instruction fixed. From thinking about this, I came to the temporary conclusion, that even the matching constraints get "new" numbers, which would explain the %rax being %3 == %0 == %rax. So, is there a way to tell GCC to only count the "r" constraints? (I know that I can get this inline assembly to work by just giving up on multiple alternative constraints.)
BTW: Is there any useful documentation of GCC'S inline assembly? The official manual with zero examples when it comes to the interesting stuff is nothing I would call useful in this context. Searching with Google didn't make me find any. All howtos and stuff just speak about the trivial basic things but completely omit more advanced stuff like multiple alternative constraints just completely.
The first thing that comes to mind is:
inline __uint128_t add_128_128_128(__uint128_t a, __uint128_t b) {
    asm("add %1, %%rax\n\t"
        "adc %2, %%rdx"
        : "+A"(a)
        : "r"((uint64_t)(b >> 64)), "r"((uint64_t)b)
        : "cc");
    return a;
}
that's because GCC can treat RDX:RAX as double-sized register pair with the "A" constraint. This is sub-optimal though particularly for inlining, because it doesn't take into account that the two operands are interchangeable, and by returning always in RDX:RAX it also restrains the register choices.
To get that commutativity in, you can use the % constraint modifier:
inline __uint128_t add_128_128_128(__uint128_t a, __uint128_t b) {
    uint64_t a_lo = a, a_hi = a >> 64, b_lo = b, b_hi = b >> 64;
    uint64_t r_lo, r_hi;
    asm("add %3, %0\n\t"
        "adc %5, %1"
        : "=r"(r_lo), "=r"(r_hi)
        : "%0" (a_lo), "r"(b_lo), "%1"(a_hi), "r"(b_hi) :
        : "cc");
    return ((__uint128_t)r_hi) << 64 | r_lo;
}
The % indicates to GCC that this operand and the next one are interchangeable.
This creates the following code (non-inlined):
Disassembly of section .text: 0000000000000000 <add_128_128_128>: 0: 48 89 f8 mov %rdi,%rax 3: 48 01 d0 add %rdx,%rax 6: 48 11 ce adc %rcx,%rsi 9: 48 89 f2 mov %rsi,%rdx c: c3 retq
which looks pretty much like what you wanted ?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With