Consider this typical example:
// Thread 1:
r1 = y.load(std::memory_order_relaxed); // A
x.store(r1, std::memory_order_relaxed); // B
// Thread 2:
r2 = x.load(std::memory_order_relaxed); // C
y.store(42, std::memory_order_relaxed); // D
A possible output is r1 == r2 == 42. Most people like using "reordering" to interpret this example. For instance, the interpretation could be like: D and A are reordered as if D is executed before A, and other statements are executed in the order as they appear in the source file. This sounds like there exists a global coordinate that we can compare the executions in multiple threads on. Or, thread 1 could observe how D in thread 2 is executed and recognize that D is executed before A in itself. I think "reordering" can make things worse when reasoning about programs.
In this example, D is only executed by thread 2. In other words, if thread 2 doesn't make progress on D, there cannot exist a side effect on the object y. Moreover, saying D and A are reordered such that D is executed before A sounds like these two threads are executed in interleaving by a single physical CPU; however, this is unspecified by the ISO C++ standard. These two threads are likely executed by two physical CPUs, respectively; that is, the executions in thread 1 and executions in thread 2 are irrelevant, as they do not have interference with each other.
A and D just shared the same memory location; they are executed by different threads, thread 1 doesn't care about and cannot recognize how D is executed. The only useful thing is whether the side effect produced by the execution of D is visible to A, which depends on "happens-before"(for non-atomic and atomic objects, plus [intro.races] p10 and coherent rules for atomic objects).
Furthermore, saying D is executed before A cannot prove that the side effect produced by D is visible to A. [intro.execution] p8 says:
Sequenced before is an asymmetric, transitive, pair-wise relation between evaluations executed by a single thread ([intro.multithread]), which induces a partial order among those evaluations. Given any two evaluations A and B, if A is sequenced before B (or, equivalently, B is sequenced after A), then the execution of A shall precede the execution of B.
The proposition is: if A is sequenced before B, then the execution of A precedes the execution of B, and the precondition is that A and B are executed by a single thread. It's wrong to infer from the contrary. That is, we cannot say: if the execution of A precedes the execution of B, then A is sequenced before B such that A is visible to B. From this point, "reordering" is meaningless. I think this is the source that, when mentioning "X is executed" implies "Y has been executed", people would intuitively establish an order between X and Y due to the "has been" applied to execution, and conclude that "Y" happens before "X".
In the typical example, using "reordering" may conform to people's intuition, but it can cause trouble when analyzing an example that needs counter-proof. For example:
#include <iostream>
#include <atomic>
#include <thread>
#include <chrono>
#include <cassert>
int main(){
std::atomic<int> v = 0;
std::atomic<bool> flag = false;
std::thread t1([&](){
while(!flag.load(std::memory_order::relaxed)){} // #1
assert(v.exchange(2,std::memory_order::relaxed) == 1); // #2
});
std::thread t2([&](){
if(v.exchange(1,std::memory_order::relaxed) == 0){ // #3
flag.store(true, std::memory_order::relaxed); // #4
}
});
t1.join();
t2.join();
}
To prove that #2 never fails, we need the counter-proof where the condition in if is false. That is, the execution of #4 could affect whether #2 can be reachable. #2 is only reachable after #1 exits; this conclusion is regulated by [intro.execution] p8.
If we used "reordering" or associated the order in terms of the ISO C++ standard with the execution of expressions, it would bring some confusion. For example, can #2 and #3 reorder, which in turn need to consider that #2 can reorder across a potentially infinite loop at #1, and so on.
So, is reordering really a useful concept for multithread program reasoning?
In ISO C++, if you can manage to think only in terms of the formalism of the standard and never how that maps to real CPUs, you can get away without it.
If answering language-lawyer questions or trying to prove something is safe, allowed reorderings are just a shortcut for quickly spotting some possible undesired results. Like, you can easily see something can happen in the abstract machine if it can happen on AArch64 the way the code would compile there. (AArch64 is often a good choice, directly providing seq_cst and acq/rel stores and loads that aren't a lot stronger than what's necessary to conform, except of course it's a real-world ISA with coherent cache. Also it's multi-copy atomic so no IRIW reordering.)
But exhausting all possible reorderings doesn't prove that the abstract machine doesn't allow something. For that you do need to reason with the C++ memory model if there's anything non-trivial.
Looking at possible reorderings is a quick early-out in the mental process of checking things, if that way of thinking comes naturally to you, before diving into the formalism of the standard.
Reordering is pretty intuitive to understand at a basic level, e.g. https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ . (But keep in mind there's also store-forwarding, when a load reloads a store by the same thread and they weren't both seq_cst. That reordering article doesn't mention that effect.)
In some other languages, notably C# / .NET, the memory model is defined in terms of allowed reorderings around various types of operations: What memory model is implemented in .NET Core? . (C# doesn't have data-race UB; in fact being a memory-safe environment it doesn't have UB at all. For non-atomic / non-Volatile types, it has some no-tearing guarantees for some type sizes, but without visibility guarantees.)
Most hardware ISA memory models are defined in terms of allowed or disallowed reorderings, with various litmus tests. For example https://arangodb.com/2021/02/cpp-memory-model-migrating-from-x86-to-arm/ summarizes x86's TSO (Total Store Order) model which was formalized by a third party paper (Sewell2010), and the weak ARMv7 / POWER models which aren't multi-copy-atomic. (ARMv8 is similar but is multi-copy atomic.) That article also talks about the C++ model, but doesn't really contrast them.
The Linux kernel memory model predates ISO C++'s, and is defined in terms of orderings and barriers. https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0124r6.html describes it as "currently defined very informally in the memory-barriers.txt, atomic_ops.rst, atomic_bitops.txt, atomic_t.txt, and refcount-vs-atomic.rst files in the source tree".
You can reason about an algorithm in terms of allowed reorderings and which orders are necessary to preserve the invariants you need.
I assume it's possible to make formal arguments that way about program correctness, at least for memory models defined that way. ISO C++ isn't, so you'd first need a mapping from a reordering model to ISO C++'s model if you wanted to prove things about the abstract machine, which sounds complicated. But if you're reasoning about a similar C# program, reordering would be the way to go.
This sounds like there exists a global coordinate that we can compare the executions in multiple threads on.
On multi-copy-atomic architectures, there is a global order of all stores, even to separate locations. Or at least no threads can disagree about order of two stores they didn't do themselves. Not necessarily consistent with program-order (unless you're on a stronger memory model like x86-TSO).
Even x86's strong (TSO = Total Store Order) memory model allows StoreLoad reordering. It's basically program order plus a store buffer with store-forwarding, accessing coherent shared cache. (Probably not coincidentally, that was what 486 had, and 486 was the first x86 to properly support SMP without third-party chipsets. All later x86 have tried to avoid breaking multithreaded code that worked on 486, hence the strongly-ordered memory model.) https://preshing.com/20120930/weak-vs-strong-memory-models/
So anyway yes, you can reason about the StoreLoad litmus test and find the possible results with models that do include a global timeline of stores at least, unlike C++'s. See Globally Invisible load instructions for some discussion of where loads take their value from on x86, and the possibility of a narrow store + wider reload seeing a value that never appears in memory, which no other thread could see. (C++ can't do this without data-race UB.) So even in strongly-ordered x86, results aren't always explainable by having an atomic load take all its bytes from one point in the total order of stores.
Related:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With