I have a good conceptual understanding of C++11's std::memory_order types (relaxed vs acquire-release vs sequentially consistent ...), but I'd like to have a better understanding of how they are typically implemented (by a compiler) for x86 (or x86_64) targets.
Specifically, a comparison of the low-level details (such as important memory-related CPU instructions for synchronizing state or cache between processors) for each of the order constraints (memory_order_consume, memory_order_acquire, memory_order_release, and memory_order_seq_cst).
Please provide as much low-level detail as possible, preferably for x86_64 or a similar architecture. Your help will be very much appreciated.
On x86 and x86_64 loads have acquire semantics and stores have release semantics anyway, even without using atomics, so all the memory orders except seq_cst require no special instructions at all.
To get full sequential consistency the compiler can insert an mfence instruction to prevent reordering of operations on distinct memory locations, but I don't think any other special instructions are needed. 
Compilers need to avoid moving loads and stores across atomic operations, but that's purely a limitation on the compiler optimisers and requires no CPU instructions to be issued.
See http://www.stdthread.co.uk/forum/index.php?topic=72.0 for some good information.
Herb Sutter breaks this down for x86 and other architectures included PowerPC and ARM in his atomic<> Weapons talks from C++ and Beyond 2012. I think the relevant slides are in the second part but the first part is also worth watching.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With