My colleague and myself are unsuccessful in explaining why GCC, ICC and Clang do not optimize this function
void f(std::uint64_t a, void * p) {
    std::uint8_t *x = reinterpret_cast<std::uint8_t *>(p);
    x[7] = a >> 56;
    x[6] = a >> 48;
    x[5] = a >> 40;
    x[4] = a >> 32;
    x[3] = a >> 24;
    x[2] = a >> 16;
    x[1] = a >> 8;
    x[0] = a;
}
Into this
mov     QWORD PTR [rsi], rdi
If we formulate f in terms of memcpy, it emits just that mov. Why does it not happen if we do the seemingly trivial sequence of byte writes?
I'm not an expert, but gcc only gained the ability to merge adjacent stores for immediate constants in gcc 7:
If I had to guess, by the second bug, it might not be too long a wait.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With