I have a situation in multi-threaded C++ code where I need to make some very fast operations atomic (appear serialised) so I can use a spinlock such as:
lock mutex: while (lock.test_and_set(std::memory_order_acquire))
unlock mutex: lock.clear(std::memory_order_release);
However I thought to be clever, and make the locking conditional on whether the data structure is currently shared by more than one thread or not:
lock mutex: if(lockneeded) while (lock.test_and_set(std::memory_order_acquire))
unlock mutex: if(lockneeded)lock.clear(std::memory_order_release);
Initially, the data structure is owned by only one thread but access can be granted to another thread by the owner at which point it has to set the lock-needed variable (which has to be an atomic bool itself).
Will this work?
EDIT: some context. I have a system which schedules coroutines. A queue of suspended coroutines is run one at a time by a single thread until it suspends or completes, then the next one is run. This system was originally designed for a single thread, since coroutines by specification are sequential programming constructs. The context switching time is extremely fast because coroutines use heap allocated linked lists for stacks, not the machine stack. So a context switch is basically just a pointer swap.
Then I decided to optionally allow more than one thread to process the list, so the coroutines become processes. Now the pointer swaps must be done atomically. The swaps are extremely fast, so a spinlock seems like the right way to guard the operations.
I have a test case, where I run a set of jobs serially, then do it again with extra helper threads. I had a problem which i have now fixed that turned out to be unrelated to scheduling. Now, 4 threads run the process about 3.5 times faster than 1.
The performance objective is simple: I want to wipe Go-lang off the face of the planet. My system is C/C++ ABI compliant (Go isn't), it uses the correct model for stream processing (Go doesn't), and its a hands down vastly superior language as well.
I don't know how fast Go can context switch. But the current untuned version of my test case, in which we must not forget jobs count to 100K to create a delay (and ensure near zero contention on the lock), is processing 2 MILLION processes in 5 seconds, which is a context switching rate of about 400K switches per second. I expect if I replace the slow jobs with null jobs (do nothing coroutines), the rate will exceed 1 milllion switches per second. That's running 2 million processes. The real world speed will be lower, the experiment is trying to find the upper bound on performance.
No, unfortunately this will not work.
Say Thread A sees lockneeded is false and enters the critical section without acquiring lock, then a context switch occurs in the middle of the critical section. Thread B requests access to the data structure. The data structure doesn't know Thread A is in the critical section, so Thread B is granted access. lockneeded is set to true, but Thread A is already inside its critical section. Thread B then acquires lock... you can easily see that this is undefined behavior.
It can't work unless you can guarantee that lockneeded won't change during the critical section. A way to guarantee that lockneeded won't change is to use a lock to protected it. So you would need to add a lock to every access of lockneeded, therefore defeating the purpose of the variable in the first place.
Efficient C++ spinlocks
A spinlock is so conceptually simple, yet there are many flavors available. The important factors to consider are performance requirements (does it really need to be that efficient?), architecture, threading library, desired scalability, amount of expected contention (if contention is rare, you can optimize for the non-contention case), asymmetry of critical sections using the same lock (to prevent thread starving), ratio of reading to writing... You can see that if you need it to be super efficient, there's a lot of performance testing you need to do. So if you don't really need the performance, you should just use the spinlock you have and spend your time elsewhere.
But we're computer scientists and we like the most efficient solution because we're problem solvers. For a highly contentious, highly scalable spinlock, check out MCS lock. For a generally good spinlock, I ran some tests a while ago and found that pthreads' spinlock was pretty scalable.
And there is another way to guarantee that Thread A is not in the critical section without Thread A having to write anything. It's called rcu_synchronize and, to grossly oversimplify, it would involve Thread B setting lockneeded and waiting a sufficient amount of time to guarantee that any thread in a critical section will finish it.
The naive spinlock is scales poorly because of the bus traffic due to cache misses of the lock variable (a global write invalidates other cores who are also spinning).
A simple optimization you can do is the "spin on read" spinlock:
lock mutex: while (lock.load(std::memory_order_acquire) || lock.test_and_set(std::memory_order_acquire)) {}
unlock mutex: no change
So if another thread has the lock, this thread doesn't bother with the TSL (due to OR short-circuiting), but when the other thread releases the lock, the thread attempts the TSL which may or may not be successful. Unfortunately, this lock performs as poorly as the naive spinlock in high-scaling scenarios but might save you a some cycles from time to time over the naive spinlock in a low-scaling, medium-contention situation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With