Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++17 Lock-free allocator crashing -- "access violation reading location"

Tags:

c++

allocation

I have a basic block-based allocator that will be used to allocate memory for POD-types in a multi-threaded setting, where all threads will be requesting or releasing memory simultaneously to a shared memory pool. My current implementation is detailed here:

#include <concurrent_queue.h>
#include <atomic>

template <typename T, size_t BlockSize> class BlockAlloc {
private:
  // element_t is either the data for T or a pointer to the next free element_t
  struct element_t {
    ::byte buffer[sizeof(T) > sizeof(element_t*) ? sizeof(T) : sizeof(element_t*)];
    element_t*& next() {return reinterpret_cast<element_t*&>(buffer);};
    T*& data() {return reinterpret_cast<T*&>(buffer);};
  };

  // block_t is a contiguous block of elements
  struct block_t {
    element_t elements[BlockSize];
  };

  // Allocate one new block of contiguous elements
  void AllocBlock() {
    block_t* block = new block_t();

    // each element in the block points to the previous element...
    for (int i = 1; i < BlockSize; i++)
    block->elements[i].next() = &block->elements[i - 1];

    // free should point to the final element in the block, 
    // while the first element of the block points to free's old location.
    // Update made with compare-and-swap loop:
    while (!free.compare_exchange_strong(block->elements[0].next() = free.load(), &block->elements[BlockSize-1])) {}

    // add the block to the list of all allocated blocks
    blocks.push(block);
  };

  // Release all memory held by all blocks
  void ReleaseBlocks() {
    free = nullptr;
    block_t* block {nullptr};
    while (blocks.try_pop(block)) delete block;
  };

public:
  BlockAlloc() : blocks {}, free(nullptr) { AllocBlock(); };
  ~BlockAlloc() { ReleaseBlocks(); };

  // Acquire a new element from the free list and construct it.
  template <typename... TArgs> _type_* Alloc(TArgs &&... a) {
    _type_* data {nullptr};
    element_t* element {nullptr};

    // Grab the free element and set free to the next item in that list
    // Update made with compare-and-swap loop:
    while (true) {
      if (element = free.load()) {
        // CRASHES HERE DURING THE CMPxSWP,
        // 0xC0000005: "access violation reading location"
        if (free.compare_exchange_strong(element, element->next())) {
          data = element->data();
          new (data) _type_(std::forward<TArgs>(a)...);
          return data;
        }
      }
      else {
        // if free is empty, grow the list
        AllocBlock();
      }
    }
  };

  // Destroys the element and return its memory to the free list
  void Free(_type_* element) {
    if (element == nullptr) {return;}
    element->~_type_();
    element_t* t = (element_t*)(element);

    while (true) 
      if (free.compare_exchange_strong(t->next() = free.load(), t)) 
        return;
  };

  concurrency::concurrent_queue<block_t*> blocks;
  std::atomic<element_t*> free;
};

Running in a single thread this implementation works correctly, however when testing in a multi-threaded application, this code will quickly crash at the compare-and-swap operation of the Alloc call, with exception:

0xC0000005: "access violation reading location". 

I cannot tell by inspection what I am doing incorrectly that would cause this exception to be thrown. Any advice or input?

like image 661
Robert Good Avatar asked Sep 02 '25 04:09

Robert Good


1 Answers

free attempts to be a lock-free stack of addresses, but it falls short in ABA protection. Thread can update the free pointer while another thread is in middle of storing (a stale) next pointer:

A: while (!free.compare_exchange_strong(block->elements[0].next() = free.load(), &block->elements[BlockSize-1])) {}

A: x = load(free)    B: y = load(free)
   store(next, x)       
   <thread A goes to lunch>
                     B: reorders top two items in the stack.
A: x is now stale.
   CAS(free, x, addr)

You can use a 16-byte CAS and a counter that you increment on each attempted CAS. However, due to the standard library ABI, GCC/Clang can refuse to emit the CMPXCHG16B on x86-64. So static_assert(std::atomic<bytes16>::is_always_lock_free); would error, and you would get bad performance if you ignored this.

The last (portable-and-performant, but limited count of threads) option I can think of is to align the memory addresses to say 32-byte boundaries and use the 5 low zero bits of the address as an ABA counter. Non-portable, (but common) way is to use the high 8 or 16-bits of the virtual address for the ABA-tag.

Edit: I'd like add a note about the ABA counter. It is enough that each thread stores an per-thread unique tag with the CAS. Such a per-thread constant tag is likely better than a incrementing counter since it cannot overflow, which can silently break an otherwise working lock-free algorithm.

Why is your element buffer storing next pointers and data values in the same memory (and why are you not using a union for that purpose)?

Space-compaction, not that this matters much here. The next pointer is only needed for items that are in the stack. (i.e. free). AllocBlock() pushes a block of addresses at once. Alloc() pops a single address, after which the succeeding thread owns that piece of memory. (Push/pop requires ABA protection.) Free() then pushes an address onto the stack to be reused by Alloc().

like image 167
JATothrim Avatar answered Sep 04 '25 18:09

JATothrim