I am trying to improve my understanding of memory barriers. Suppose we have a weak memory model and we adapt Dekker's algorithm. Is it possible to make it work correctly under the weak memory model by adding memory barriers?
I think the answer is a surprising no. The reason (if I am correct) is that although a memory barrier can be used to ensure that a read is not moved past another, it cannot ensure that a read does not see stale data (such as that in a cache). Thus it could see some time in the past when the critical section was unlocked (per the CPU's cache) but at the current time other processors might see it as locked. If my understanding is correct, one must use interlocked operations such as those commonly called test-and-set or compare-and-swap to ensure synchronized agreement of a value at a memory location among multiple processors.
Thus, can we rightly expect that no weak memory model system would provide only memory barriers? The system must supply operations like test-and-set or compare-and-swap to be useful.
I realize that popular processors, including x86, provide memory models much stronger than a weak memory model. Please focus the discussion on weak memory models.
(If Dekker's algorithm is a poor choice, choose another mutual exclusion algorithm where memory barriers can successfully achieve correct synchronization, if possible.)
A memory barrier is an instruction that requires the processor to apply an ordering constraint between memory operations that occur before and after the memory barrier instruction in the program. Such instructions are also known as memory fences in other architectures.
Memory barrier is implemented by the hardware processor. CPUs with different architectures have different memory barrier instructions. Therefore, the programmer needs to explicitly call memory barrier in the code to solve the preceding problem.
There is no difference. "Fence" and "barrier" mean the same thing in this context.
Fencing is the process of isolating a node of a computer cluster or protecting shared resources when a node appears to be malfunctioning.
You are right that a memory barrier cannot ensure that a read sees up-to-date values. What it does do is enforce an ordering between operations, both on a single thread, and between threads.
For example, if thread A does a series of stores and then executes a release barrier before a final store to a flag location, and thread B reads from the flag location and then executes an acquire barrier before reading the other values then the other variables will have the values stored by thread A:
// initially x=y=z=flag=0
// thread A
x=1;
y=2;
z=3;
release_barrier();
flag=1;
// thread B
while(flag==0) ; // loop until flag is 1
acquire_barrier();
assert(x==1);  // asserts will not fire
assert(y==2);
assert(z==3);
Of course, you need to ensure that the load and store to flag is atomic (which simple load and store instructions are on most common CPUs, provided the variables are suitably aligned). Without the while loop on thread B, thread B may well read a stale value (0) for flag, and thus you cannot guarantee any of the values read for the other variables.
Fences can thus be used to enforce synchronization in Dekker's algorithm.
Here's an example implementation in C++ (using C++0x atomic variables):
std::atomic<bool> flag0(false),flag1(false);
std::atomic<int> turn(0);
void p0()
{
    flag0.store(true,std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_seq_cst);
    while (flag1.load(std::memory_order_relaxed))
    {
        if (turn.load(std::memory_order_relaxed) != 0)
        {
            flag0.store(false,std::memory_order_relaxed);
            while (turn.load(std::memory_order_relaxed) != 0)
            {
            }
            flag0.store(true,std::memory_order_relaxed);
            std::atomic_thread_fence(std::memory_order_seq_cst);
        }
    }
    std::atomic_thread_fence(std::memory_order_acquire);
    // critical section
    turn.store(1,std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_release);
    flag0.store(false,std::memory_order_relaxed);
}
void p1()
{
    flag1.store(true,std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_seq_cst);
    while (flag0.load(std::memory_order_relaxed))
    {
        if (turn.load(std::memory_order_relaxed) != 1)
        {
            flag1.store(false,std::memory_order_relaxed);
            while (turn.load(std::memory_order_relaxed) != 1)
            {
            }
            flag1.store(true,std::memory_order_relaxed);
            std::atomic_thread_fence(std::memory_order_seq_cst);
        }
    }
    std::atomic_thread_fence(std::memory_order_acquire);
    // critical section
    turn.store(0,std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_release);
    flag1.store(false,std::memory_order_relaxed);
}
For a full analysis see my blog entry at http://www.justsoftwaresolutions.co.uk/threading/implementing_dekkers_algorithm_with_fences.html
Say you put in a load and store barrier after every statement, and in addition you ensured that the compiler didn't reorder your stores. Wouldn't this, on any reasonable architecture, provide strict consistency? Dekker's works on sequentially consistent architectures. Sequential consistency is a weaker condition than strict consistency.
http://www.cs.nmsu.edu/~pfeiffer/classes/573/notes/consistency.html
Even on a CPU that has a weak consistency model, you'd still expect cache coherence. I think that where things get derailed is the behavior of store buffers and speculated reads, and what operations are available flush stored writes and invalidate speculated reads. If there isn't a load fence that can invalidate speculated reads, or there isn't a write fence that flushes a store buffer, in addition to not being able to implement Dekker's, you won't be able to implement a mutex!
So here's my claim. If you have a write barrier available, and a read barrier, and the cache is coherent between CPUs, then you can trivially make all code sequentially consistent by flushing writes (store fence) after every instruction, and flushing speculations (read fence) before every instruction. So I claim that you don't need atomics for what you're talking about, and that you can do what you need with Dekker's only. Sure you wouldn't want to.
BTW, Corensic, the company I work for, writes cool tools for debugging concurrency issues. Check out http://www.corensic.com.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With