I am working on a server having to read from many thousands of socket clients connected at the same time. The client requests are constituted by messages having all the same exact size of about 32 bytes.
I am reading about the slab allocator and I'd like to use this particular technique for my application when I call read to get the data out of the socket (read copies data from kernel buffer into a buffer of my choice and I'd like to use some memory dynamically allocated).
While I am reading it seems that Linux kernel is already using this technique. If this is used for implementing malloc or new is it still worth for me to do it given that allocation is already performant?
I was thinking that I might be better off by using allocation on the stack without SLAB algorithm but I am not sure which is the best approach.
If you are a C programmer, of course you should get dirty with memory management!
However, you will probably not encounter any problems with simply malloc'ing each request, unless you are really getting close to the limits of your machine, which seems unlikely. But I believe it's better to know your alternatives than to take someone's word for it. Here are some ideas to consider.
The simplest alternative is to use a single global array of request slots, and track which ones are in use. This means a static limit on how many requests, but on the other hand, no overhead and no real issue with fragmentation. Just set the limit really high.
Here's an example implementation. If you aren't familiar with bitwise operations, it may look a little confusing, but the gist is that we have an additional array containing one bit (on or off) per request slot, specifying whether that slot is in use. You could, instead, add an 'is_used' variable to the struct itself, but that would end up padding the struct out by a lot more than one bit, working against our goal of minimizing overhead.
Header file is quite minimal (this is the true beauty of C, by the way!):
typedef struct request_s {
  /* your 32 bytes of information */
  unsigned char data[32];
} request_t;
request_t *alloc_request(void);
void free_request(request_t *req);
Source file:
#include the header file
/* note: this example is not written with multithreading in mind */
/* allow a million requests (total 32 MB + 128 KB of memory) */
#define MAX_REQUESTS (1*1024*1024)
static request_t g_requests[MAX_REQUESTS];
/* use one bit per request to store whether it's in use */
/* unsigned int is 32 bits. shifting right by 5 divides by 32 */
static unsigned int g_requests_used[MAX_REQUESTS >> 5];
request_t *alloc_request(void) {
  /* note: this is a very naive method. you really don't want to search
   * from the beginning every time, but i'll leave improving that as an
   * exercise for you. */
  unsigned int word_bits;
  unsigned int word, bit;
  /* look through the bit array one word (i.e., 32 bits) at a time */
  for (word = 0; word < (MAX_REQUESTS >> 5); word++) {
    word_bits = g_requests_used[word];
    /* we can tell right away whether the entire chunk of 32 requests is
     * in use, and avoid the inner loop */
    if (word_bits == 0xFFFFFFFFU)
      continue;
    /* now we know there is a gap somewhere in this chunk, so we loop
     * through the 32 bits to find it */
    for (bit = 0; bit < 32; bit++) {
      if (word_bits & (1U << bit))
        continue; /* bit is set, slot is in use */
      /* found a free slot */
      g_requests_used[word] |= 1U << bit;
      return &g_requests[(word << 5) + bit];
    }
  }
  /* we're all out of requests! */
  return NULL;
}
void free_request(request_t *req) {
  /* make sure the request is actually within the g_requests block of
   * memory */
  if (req >= g_requests && req < g_requests + MAX_REQUESTS) {
    /* find the overall index of this request. pointer arithmetic like this
     * is somewhat peculiar to c/c++, you may want to read up on it. */
    ptrdiff_t index = req - g_requests;
    /* reducing a ptrdiff_t to an unsigned int isn't something you should
     * do without thinking about it first. but in our case, we're fine as
     * long as we don't allow more than 2 billion requests, not that our
     * computer could handle that many anyway */
    unsigned int u_index = (unsigned int)index;
    /* do some arithmetic to figure out which bit of which word we need to
     * turn off */
    unsigned int word = u_index >> 5; /* index / 32 */
    unsigned int bit  = u_index & 31; /* index % 32 */
    g_requests_used[word] &= ~(1U << bit);
  }
}
(Yes, yes, you could write index / 32 instead of index >> 5, and so on, and the compiler will optimize it for you. But it just doesn't feel right to me...)
You can go into quite some depth and get pretty creative in optimizing the allocator's search for a free slot. One simple idea is to start your search at the location of the last allocation, and wrap around when you get to the end.
This method has a lot of benefits, but one big drawback: the limit. You'll probably want to set the limit higher than the greatest amount of requests you ever expect having at once. But if you can do that, then you probably aren't bumping up against the limits of your system, so why are you here in the first place?!
If you don't like the static limit, you can batch allocations. Keep a linked list of memory "pools", each holding a certain fixed amount of request slots. Each individual pool is basically a static array as described above. If all existing pools are full, we malloc a new pool and add it to the linked list. A pool basically looks like this:
#define REQUESTS_PER_POOL 1024
typedef struct request_pool_s request_pool_t;
struct request_pool_s {
  request_t requests[REQUESTS_PER_POOL];
  unsigned int requests_used[REQUESTS_PER_POOL >> 5];
  request_pool_t *prev;
  request_pool_t *next;
};
You'll want to be able to free pools when traffic dies down, otherwise this would hardly be different from the static limit! Unfortunately, in that case, you will be dealing with fragmentation issues. You could find yourself with a lot of sparsely used pools, especially if it's possible that requests sometimes last a long amount of time. An entire pool can't be freed until every last slot on it is empty. You'd still be saving on overhead (considering the small size of individual requests), but dealing with the fragmentation could turn this from a small, elegant solution into more work than it's worth.
You could decrease the number of requests per pool to reduce the effect of fragmentation, but at this point we'd be losing the advantages of this method.
First of all, the main reasons you should be considering alternatives to individual mallocs at all are: the small size of your struct (32 bytes), the large amount of them, and the frequency at which they are created and destroyed.
Static arrays reduce overhead by a lot, but are hard to justify in this day and age. Unless your server is running on an Arduino.
Memory pools are the obvious direction for a problem like this, but they can require a fair bit of work to get working smoothly. If this is up your alley, then I say go for it.
Slab allocators are like complex memory pools that aren't limited to a certain single struct size. They'd be overkill for you since you have only the 32-byte requests, although perhaps you could find a third-party library that works for you.
Taking the easy route and simply malloc'ing each request is a slippery slope and may end up with you abandoning C entirely. ;)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With