I'm looking for an equivalent of LWARX and STWCX (as found on the PowerPC processors) or a way to implement similar functionality on the x86 platform. Also, where would be the best place to find out about such things (i.e. good articles/web sites/forums for lock/wait-free programing).
Edit
I think I might need to give more details as it is being assumed that I'm just looking for a CAS (compare and swap) operation. What I'm trying to do is implement a lock-free reference counting system with smart pointers that can be accessed and changed by multiple threads. I basically need a way to implement the following function on an x86 processor.
int* IncrementAndRetrieve(int **ptr)
{
  int val;
  int *pval;
  do
  {
    // fetch the pointer to the value
    pval = *ptr;
    // if its NULL, then just return NULL, the smart pointer
    // will then become NULL as well
    if(pval == NULL)
      return NULL;
    // Grab the reference count
    val = lwarx(pval);
    // make sure the pointer we grabbed the value from
    // is still the same one referred to by  'ptr'
    if(pval != *ptr)
      continue;
    // Increment the reference count via 'stwcx' if any other threads
    // have done anything that could potentially break then it should
    // fail and try again
  } while(!stwcx(pval, val + 1));
  return pval;
}
I really need something that mimics LWARX and STWCX fairly accurately to pull this off (I can't figure out a way to do this with the CompareExchange, swap or add functions I've so far found for the x86).
Thanks
if you are on 64 bits and limit yourself to say 1tb of heap, you can pack the counter into the 24 unused top bits. if you have word aligned pointers the bottom 5 bits are also available.
int* IncrementAndRetrieve(int **ptr)
{
  int val;
  int *unpacked;
  do
  {   
    val = *ptr;
    unpacked = unpack(val);
    if(unpacked == NULL)
      return NULL;
    // pointer is on the bottom
  } while(!cas(unpacked, val, val + 1));
  return unpacked;
}
Don't know if LWARX and STWCX invalidate the whole cache line, CAS and DCAS do. Meaning that unless you are willing to throw away a lot of memory (64 bytes for each independent "lockable" pointer) you won't see much improvement if you are really pushing your software into stress. The best results I've seen so far were when people consciously casrificed 64b, planed their structures around it (packing stuff that won't be subject of contention), kept everything alligned on 64b boundaries, and used explicit read and write data barriers. Cache line invalidation can cost approx 20 to 100 cycles, making it a bigger real perf issue then just lock avoidance.
Also, you'd have to plan different memory allocation strategy to manage either controlled leaking (if you can partition code into logical "request processing" - one request "leaks" and then releases all it's memory bulk at the end) or datailed allocation management so that one structure under contention never receives memory realesed by elements of the same structure/collection (to prevent ABA). Some of that can be very counter-intuitive but it's either that or paying the price for GC.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With