Having been reading through Understanding the Linux kernel (Bovet & Cesati), the chapter on Kernel Synchronisation states that the spin lock acquisition code boils down to:
1: lock:
   btsl    $0, slp
   jnc     3
2: testb   $1, slp
   jne     2
   jmp     1
3: 
Now I originally thought that it seemed wasteful to have nested loops and you could implement something like:
1: lock:
   btsl    $0, slp
   jc      1
which would be a lot simpler. However, I see why they did it since the lock affects the other CPUs and the timings for the btsl are larger than those for a simple testb.
The one thing I haven't been able to get my head around is the subsequent release of the spin lock. The book states that it yields the following:
   lock:
   btrl    $0, slp
My question is basically why? It seems to me that a lock/mov-immediate combo is faster.
You don't need to get the old state to the carry flag since, following the rule that the kernel is bug-free (assumed in lots of other places inside said kernel), the old state will be 1 (you wouldn't be trying to release it if you hadn't already acquired it).
And a mov is much faster than a btrl, at least on the 386.
So what am I missing?
Have the timings changed for those instructions on later chips?
Has the kernel been updated since the book was printed?
Is the book just plain wrong (or showing simplified instructions)?
Have I missed some other aspect involving syncronisation between CPUs that the faster instruction doesn't satisfy?
Well, Understanding the Linux Kernel is old. Since it was written, the Linux kernel was updated to use the so-called ticket spinlocks. The lock is basically a 16-bit quantity split in two bytes: let's call one Next (like the next ticket in a dispenser) and the other Owner (like the 'Now Serving' number over a counter). A spinlock is initialized with both parts set to zero. Locking notes the value of the spinlock and incrementing Next, atomically. If the value of Next before incrementing is equal to Owner, the lock has been obtained. Otherwise, it spins until Owner is incremented to the right value and so forth.
The relevant code is in asm/spinlock.h (for x86). The unlock operation is indeed much faster and simpler than the book says:
static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
{
    asm volatile(UNLOCK_LOCK_PREFIX "incb %0"
         : "+m" (lock->slock)
         :
         : "memory", "cc");
}
since inc is about 8 or 9 times faster than btr.
Hope this helps; if not, I'd be happy to dig deeper.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With