I have a basic block-based allocator that will be used to allocate memory for POD-types in a multi-threaded setting, where all threads will be requesting or releasing memory simultaneously to a shared memory pool. My current implementation is detailed here:
#include <concurrent_queue.h>
#include <atomic>
template <typename T, size_t BlockSize> class BlockAlloc {
private:
  // element_t is either the data for T or a pointer to the next free element_t
  struct element_t {
    ::byte buffer[sizeof(T) > sizeof(element_t*) ? sizeof(T) : sizeof(element_t*)];
    element_t*& next() {return reinterpret_cast<element_t*&>(buffer);};
    T*& data() {return reinterpret_cast<T*&>(buffer);};
  };
  // block_t is a contiguous block of elements
  struct block_t {
    element_t elements[BlockSize];
  };
  // Allocate one new block of contiguous elements
  void AllocBlock() {
    block_t* block = new block_t();
    // each element in the block points to the previous element...
    for (int i = 1; i < BlockSize; i++)
    block->elements[i].next() = &block->elements[i - 1];
    // free should point to the final element in the block, 
    // while the first element of the block points to free's old location.
    // Update made with compare-and-swap loop:
    while (!free.compare_exchange_strong(block->elements[0].next() = free.load(), &block->elements[BlockSize-1])) {}
    // add the block to the list of all allocated blocks
    blocks.push(block);
  };
  // Release all memory held by all blocks
  void ReleaseBlocks() {
    free = nullptr;
    block_t* block {nullptr};
    while (blocks.try_pop(block)) delete block;
  };
public:
  BlockAlloc() : blocks {}, free(nullptr) { AllocBlock(); };
  ~BlockAlloc() { ReleaseBlocks(); };
  // Acquire a new element from the free list and construct it.
  template <typename... TArgs> _type_* Alloc(TArgs &&... a) {
    _type_* data {nullptr};
    element_t* element {nullptr};
    // Grab the free element and set free to the next item in that list
    // Update made with compare-and-swap loop:
    while (true) {
      if (element = free.load()) {
        // CRASHES HERE DURING THE CMPxSWP,
        // 0xC0000005: "access violation reading location"
        if (free.compare_exchange_strong(element, element->next())) {
          data = element->data();
          new (data) _type_(std::forward<TArgs>(a)...);
          return data;
        }
      }
      else {
        // if free is empty, grow the list
        AllocBlock();
      }
    }
  };
  // Destroys the element and return its memory to the free list
  void Free(_type_* element) {
    if (element == nullptr) {return;}
    element->~_type_();
    element_t* t = (element_t*)(element);
    while (true) 
      if (free.compare_exchange_strong(t->next() = free.load(), t)) 
        return;
  };
  concurrency::concurrent_queue<block_t*> blocks;
  std::atomic<element_t*> free;
};
Running in a single thread this implementation works correctly, however when testing in a multi-threaded application, this code will quickly crash at the compare-and-swap operation of the Alloc call, with exception:
0xC0000005: "access violation reading location". 
I cannot tell by inspection what I am doing incorrectly that would cause this exception to be thrown. Any advice or input?
free attempts to be a lock-free stack of addresses, but it falls short in ABA protection. Thread can update the free pointer while another thread is in middle of storing (a stale) next pointer:
A: while (!free.compare_exchange_strong(block->elements[0].next() = free.load(), &block->elements[BlockSize-1])) {}
A: x = load(free)    B: y = load(free)
   store(next, x)       
   <thread A goes to lunch>
                     B: reorders top two items in the stack.
A: x is now stale.
   CAS(free, x, addr)
You can use a 16-byte CAS and a counter that you increment on each attempted CAS. However, due to the standard library ABI, GCC/Clang can refuse to emit the CMPXCHG16B on x86-64. So static_assert(std::atomic<bytes16>::is_always_lock_free); would error, and you would get bad performance if you ignored this.
The last (portable-and-performant, but limited count of threads) option I can think of is to align the memory addresses to say 32-byte boundaries and use the 5 low zero bits of the address as an ABA counter. Non-portable, (but common) way is to use the high 8 or 16-bits of the virtual address for the ABA-tag.
Edit: I'd like add a note about the ABA counter. It is enough that each thread stores an per-thread unique tag with the CAS. Such a per-thread constant tag is likely better than a incrementing counter since it cannot overflow, which can silently break an otherwise working lock-free algorithm.
Why is your element
bufferstoringnextpointers anddatavalues in the same memory (and why are you not using aunionfor that purpose)?
Space-compaction, not that this matters much here. The next pointer is only needed for items that are in the stack. (i.e. free). AllocBlock() pushes a block of addresses at once. Alloc() pops a single address, after which the succeeding thread owns that piece of memory. (Push/pop requires ABA protection.) Free() then pushes an address onto the stack to be reused by Alloc().
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With