C++20 std::atomic- std::atomic.specializations

Question

C++20 includes specializations for atomic<float> and atomic<double>. Can anyone here explain for what practical purpose this should be good for? The only purpose I can imagine is when I have a thread that changes an atomic double or float asynchronously at random points and other threads read this values asynchronously (but a volatile double or float should in fact do the same on most platforms). But the need for this should be extremely rare. I think this rare case couldn't justify an inclusion into the C++20 standard.

Peter Cordes · Accepted Answer

atomic<float> and atomic<double> have existed since C++11. The atomic<T> template works for arbitrary trivially-copyable T. Everything you could hack up with legacy pre-C++11 use of volatile for shared variables can be done with C++11 atomic<double> with std::memory_order_relaxed.

What doesn't exist until C++20 are atomic RMW operations like x.fetch_add(3.14); or for short x += 3.14. (Why isn't atomic double fully implemented wonders why not). Those member functions were only available in the atomic integer specializations, so you could only load, store, exchange, and CAS on float and double, like for arbitrary T like class types.

See Atomic double floating point or SSE/AVX vector load/store on x86_64 for details on how to roll your own with compare_exchange_weak, and how that (and pure load, pure store, and exchange) compiles in practice with GCC and clang for x86. (Not always optimal, gcc bouncing to integer regs unnecessarily.) Also for details on lack of atomic<__m128i> load/store because vendors won't publish real guarantees to let us take advantage (in a future-proof way) of what current HW does.

These new specializations provide maybe some efficiency (on non-x86) and convenience with fetch_add and fetch_sub (and the equivalent += and -= overloads). Only those 2 operations that are supported, not fetch_mul or anything else. See the current draft of 31.8.3 Specializations for floating-point types, and cppreference std::atomic

It's not like the committee went out of their way to introduce new FP-relevant atomic RMW member functions fetch_mul, min, max, or even absolute value or negation, which is ironically easier in asm, just bitwise AND or XOR to clear or flip the sign bit and can be done with x86 lock and if the old value isn't needed. Actually since carry-out from the MSB doesn't matter, 64-bit lock xadd can implement fetch_xor with 1ULL<<63. Assuming of course IEEE754 style sign/magnitude FP. Similarly easy on LL/SC machines that can do 4-byte or 8-byte fetch_xor, and they can easily keep the old value in a register.

So the one thing that could be done significantly more efficiently in x86 asm than in portable C++ without union hacks (atomic bitwise ops on FP bit patterns) still isn't exposed by ISO C++.

It makes sense that the integer specializations don't have fetch_mul: integer add is much cheaper, typically 1 cycle latency, the same level of complexity as atomic CAS. But for floating point, multiply and add are both quite complex and typically have similar latency. Moreover, if atomic RMW fetch_add is useful for anything, I'd assume fetch_mul would be, too. Again unlike integer where lockless algorithms commonly add/sub but very rarely need to build an atomic shift or mul out of a CAS. x86 doesn't have memory-destination multiply so has no direct HW support for lock imul.

It seems like this is more a matter of bringing atomic<double> up to the level you might naively expect (supporting .fetch_add and sub like integers), not of providing a serious library of atomic RMW FP operations. Perhaps that makes it easier to write templates that don't have to check for integral, just numeric, types?

Can anyone here explain for what practical purpose this should be good for?

For pure store / pure load, maybe some global scale factor that you want to be able to publish to all threads with a simple store? And readers load it before every work unit or something. Or just as part of a lockless queue or stack of double.

It's not a coincidence that it took until C++20 for anyone to say "we should provide fetch_add for atomic<double> in case anyone wants it."

Plausible use-case: to manually multi-thread the sum of an array (instead of using #pragma omp parallel for simd reduction(+:my_sum_variable) or a standard <algorithm> like std::accumulate with a C++17 parallel execution policy).

The parent thread might start with atomic<double> total = 0; and pass it by reference to each thread. Then threads do *totalptr += sum_region(array+TID*size, size) to accumulate the results. Instead of having a separate output variable for each thread and collecting the results in one caller. It's not bad for contention unless all threads finish at nearly the same time. (Which is not unlikely, but it's at least a plausible scenario.)

If you just want separate load and separate store atomicity like you're hoping for from volatile, you already have that with C++11.

Don't use `volatile` for threading: use `atomic<T>` with `mo_relaxed`

See When to use volatile with multi threading? for details on mo_relaxed atomic vs. legacy volatile for multithreading. volatile data races are UB, but it does work in practice as part of roll-your-own atomics on compilers that support it, with inline asm needed if you want any ordering wrt. other operations, or if you want RMW atomicity instead of separate load / ALU / separate store. All mainstream CPUs have coherent cache/shared memory. But with C++11 there's no reason to do that: std::atomic<> obsoleted hand-rolled volatile shared variables.

At least in theory. In practice some compilers (like GCC) still have missed-optimizations for atomic<double> / atomic<float> even for just simple load and store. (And the C++20 new overloads aren't implemented yet on Godbolt). atomic<integer> is fine though, and does optimize as well as volatile or plain integer + memory barriers.

In some ABIs (like 32-bit x86), alignof(double) is only 4. Compilers normally align it by 8 but inside structs they have to follow the ABI's struct packing rules so an under-aligned volatile double is possible. Tearing will be possible in practice if it splits a cache-line boundary, or on some AMD an 8-byte boundary. atomic<double> instead of volatile can plausibly matter for correctness on some real platforms, even when you don't need atomic RMW. e.g. this G++ bug which was fixed by increasing using alignas() in the std::atomic<> implementation for objects small enough to be lock_free.

(And of course there are platforms where an 8-byte store isn't naturally atomic so to avoid tearing you need a fallback to a lock. If you care about such platforms, a publish-occasionally model should use a hand-rolled SeqLock or atomic<float> if atomic<double> isn't always_lock_free.)

You can get the same efficient code-gen (without extra barrier instructions) from atomic<T> using mo_relaxed as you can with volatile. Unfortunately in practice, not all compilers have efficient atomic<double>. For example, GCC9 for x86-64 copies from XMM to general-purpose integer registers.

#include <atomic>

volatile double vx;
std::atomic<double> ax;
double px; // plain x

void FP_non_RMW_increment() {
    px += 1.0;
    vx += 1.0;     // equivalent to vx = vx + 1.0
    ax.store( ax.load(std::memory_order_relaxed) + 1.0, std::memory_order_relaxed);
}

#if __cplusplus > 201703L    // is there a number for C++2a yet?
// C++20 only, not yet supported by libstdc++ or libc++
void atomic_RMW_increment() {
    ax += 1.0;           // seq_cst
    ax.fetch_add(1.0, std::memory_order_relaxed);   
}
#endif

Godbolt GCC9 for x86-64, gcc -O3. (Also included an integer version)

FP_non_RMW_increment():
        movsd   xmm0, QWORD PTR .LC0[rip]   # xmm0 = double 1.0 

        movsd   xmm1, QWORD PTR px[rip]        # load
        addsd   xmm1, xmm0                     # plain x += 1.0
        movsd   QWORD PTR px[rip], xmm1        # store

        movsd   xmm1, QWORD PTR vx[rip]
        addsd   xmm1, xmm0                     # volatile x += 1.0
        movsd   QWORD PTR vx[rip], xmm1

        mov     rax, QWORD PTR ax[rip]      # integer load
        movq    xmm2, rax                   # copy to FP register
        addsd   xmm0, xmm2                     # atomic x += 1.0
        movq    rax, xmm0                   # copy back to integer
        mov     QWORD PTR ax[rip], rax      # store

        ret

clang compiles it efficiently, with the same move-scalar-double load and store for ax as for vx and px.

Fun fact: C++20 apparently deprecates vx += 1.0. Perhaps this is to help avoid confusion between separate load and store like vx = vx + 1.0 vs. atomic RMW? To make it clear there are 2 separate volatile accesses in that statement?

<source>: In function 'void FP_non_RMW_increment()':
<source>:9:8: warning: compound assignment with 'volatile'-qualified left operand is deprecated [-Wvolatile]
    9 |     vx += 1.0;     // equivalent to vx = vx + 1.0
      |     ~~~^~~~~~

Note that x = x + 1 is not the same thing as x += 1 for atomic<T> x: the former loads into a temporary, adds, then stores. (With sequential-consistency for both).

C++20 std::atomic<float>- std::atomic<double>.specializations

Tags:

c++

floating-point

multithreading

memory-model

stdatomic

1 Answers

Don't use `volatile` for threading: use `atomic<T>` with `mo_relaxed`

Peter Cordes

Recent Activity

Donate For Us

C++20 std::atomic<float>- std::atomic<double>.specializations

Tags:

c++

floating-point

multithreading

memory-model

stdatomic

1 Answers

Don't use volatile for threading: use atomic<T> with mo_relaxed

Peter Cordes

Related questions

Recent Activity

Donate For Us

Don't use `volatile` for threading: use `atomic<T>` with `mo_relaxed`