What is the right std::atomic memory order for dynamic scheduling?

Question

I am wondering what the correct memory order is for the classic "atomic counter dynamic scheduling" idiom. That is:

use fetch-add to get the index i of the next element to process
if i is past the end of the array, terminate
process element_i thread-safely, since no other thread can have i
go to 1.

For example:

#include <atomic>

std::atomic_int counter = 0;

void foo(int *data, int size) {
    // we could also write counter++
    for (int i; (i = counter.fetch_add(1, std::memory_order::seq_cst)) < size;) {
        data[i] *= 2;
    }
}

// driver code
#include <thread>
#include <numeric>
#include <cassert>

int main() {
    int data[1'000'000];
    std::iota(std::begin(data), std::end(data), 0);

    std::thread other{foo, data, std::size(data)};
    foo(data, std::size(data));
    other.join();

    for (int i = 0; i < std::size(data); ++i) {
        assert(data[i] == i * 2);
    }
}

This code works, and it should be safe, because processing an element cannot be reordered before or after getting the next index, and all fetch-adds are observed in a consistent total order by all threads. These requirements seem overly strict to me, and I believe we could use a more relaxed ordering.

std::memory_order_relaxed and std::memory_order::acquire are too relaxed I believe, because all threads observe counter = 0; initially, and we have to ensure that data[0] *= 2 is not moved before the first fetch_add. These two memory orders would allow that.

The answer has to be one of:

std::memory_order::seq_cst
std::memory_order::acq_rel
std::memory-order::release

Which one is the correct order in this situation?

Peter Cordes · Accepted Answer

`relaxed` is sufficient.

Every counter.fetch_add(1, relaxed) will return a different value, and every value from 0 upwards will be returned once. Atomicity alone guarantees this since all operations are on the same atomic object.

There's no guarantee about which thread will get which i value, and the operations on data[i] aren't ordered, but that's fine because exactly one thread will access each data[i] until thread.join() creates synchronization between the writer and reader.

The relevant wording in the C++ standard is the part saying an atomic RMW must see the "latest value" and that a consistent modification-order exists for each atomic object separately. And that within a single thread, operations on the same atomic object respect the sequenced-before program order. (i.e. one fetch_add can't "reorder" with another fetch_add.)

In hardware, the relevant behaviours are atomicity of the RMW, and cache coherency which guarantees that all other copies of the cache line must be invalidated before an atomic RMW can do its load+add+store. (That's what makes atomic RMWs on the same address serializable.)

If not relying on `.join()` for synchronization with the reader

Otherwise you'd need a separate atomic<bool> done flag, or release RMWs if some other thread was just looking at counter. The waiting thread would have to wait for counter == size + nthreads to be sure that every thread has done a release operation sequenced-after it last data[i] *= 2. (And these form a release-sequence so the reader would synchronize-with all the threads.)

Each thread will stop doing increments after it sees a false fetch_add() < size. The first thread (in modification-order of counter) to see that happen have loaded i=size and stored i=size+1. The second thread to leave the loop will have loaded size+1 and stored size+2. So with nthreads=1 or 2, counter==size+nthreads after they've done their final store as part of the RMW.

So a load that sees counter == size+nthreads can be sure that all threads have done a fetch_add after their last data[i] *= 2;. If those were release fetch_add and this was an acquire load, the stores to the data[0..size-1] objects are guaranteed to have happened-before this load, so later code in this thread can see the modified values.

(You can only check that all threads are finished; before that it's not guaranteed that data[0..counter-1] are finished writing or anything like that. Even with seq_cst increments, you could have a thread claim an i value but stall or get scheduled out for a long time before accessing that data[i]. So any array element could still be being modified.)

Could a compiler batch this for you into `fetch_add(128, relaxed)`

... and loop over the i values? Hypothetically yes, but practically no. See Why don't compilers merge redundant std::atomic writes? - compilers don't optimize atomics at all, except for reordering non-atomic operations across them when allowed.

Hypothetically, a compiler could even statically decide to compile it so it always runs with one thread doing all the increments and the other thread doing none. (Except the final one to exit the loop). e.g. by doing a += 1 mil first and then looping over those i values. This isn't a correctness problem, and would be pretty insane for a compiler to actually do. It's not something you need to rule out.

Even if you used seq_cst, a sufficiently-smart Deathstation 9000 compiler could analyze the whole program and see that nothing synchronizes-with the values of counter, so it could still make the same transformation. If anything observed the final value of counter, it would also have to make sure counter = size + nthreads so it couldn't just do fetch_add(1'000'000) in both threads.

You want to manually use batches larger than 1

A batch size like 4096 bytes or ints could make sense. Claiming 4096 elements to work on and looping over i + 0..4095 allows that inner loop to use SIMD, and have plenty of work in-flight before the next slow atomic RMW that has to wait for that counter cache-line to bounce between threads.

Having only one thread access a cache line containing a range of data[] values avoids bouncing those cache lines around. So one 64-byte cache line is a minimum batch size you might want to use if data[] is aligned, otherwise you need larger batches so there's only splits at the ends of batches. But you want larger batches anyway since 64 bytes is only a couple or a few SIMD loads/stores.

If data[] is page-aligned, then page-sized blocks mean only one thread needs a dTLB miss for that page, and hardware prefetching on modern x86 CPUs stops at page boundaries (since contiguous virtual pages might not be physically contiguous.) So it's a natural place to end a batch, with HW prefetch not creating contention for the thread working on the next chunk.

Especially on x86 where every atomic RMW is effectively seq_cst and a full memory barrier in asm, you want to make your batch size plenty large to amortize those costs. As large as possible without leaving a lot of leftover work at the end if one thread stalls or something.

HolyBlackCat · Answer

relaxed is ok, because regardless of memory order, each separate atomic variable is always consistent across all threads.

Memory orders are only necessary when you have more than one variable (atomic or not) shared across threads, because with relaxed, threads can disagree how operations on different variables are interleaved, and stronger orders fix that.

What is the right std::atomic memory order for dynamic scheduling?

Tags:

c++

multithreading

atomic

stdatomic

Jan Schultke

2 Answers

`relaxed` is sufficient.

If not relying on `.join()` for synchronization with the reader

Could a compiler batch this for you into `fetch_add(128, relaxed)`

You want to manually use batches larger than 1

Peter Cordes

HolyBlackCat

Recent Activity

Donate For Us

What is the right std::atomic memory order for dynamic scheduling?

Tags:

c++

multithreading

atomic

stdatomic

Jan Schultke

2 Answers

relaxed is sufficient.

If not relying on .join() for synchronization with the reader

Could a compiler batch this for you into fetch_add(128, relaxed)

You want to manually use batches larger than 1

Peter Cordes

HolyBlackCat

Related questions

Recent Activity

Donate For Us

`relaxed` is sufficient.

If not relying on `.join()` for synchronization with the reader

Could a compiler batch this for you into `fetch_add(128, relaxed)`