While doing some research about lock-free/wait-free algorithms, I stumbled upon the false sharing problem. Digging a bit more led me to Folly's source code (Facebook's C++ library) and more specifically to this header file and the definition of the FOLLY_ALIGN_TO_AVOID_FALSE_SHARING macro (currently at line 130). What surprised me the most at first glance was the value: 128 (i.e.: instead of 64)...
/// An attribute that will cause a variable or field to be aligned so that
/// it doesn't have false sharing with anything at a smaller memory address.
#define FOLLY_ALIGN_TO_AVOID_FALSE_SHARING __attribute__((__aligned__(128)))
AFAIK, cache blocks on modern CPUs are 64 bytes long and actually, every resources I found so far on the matter, including this article from Intel, talk about 64 bytes aligning and padding to help work around false sharing.
Still, the folks at Facebook align and pad their class members to 128 bytes when needed. Then I found out the beginning of an explanation just above FOLLY_ALIGN_TO_AVOID_FALSE_SHARING's definition:
enum {
    /// Memory locations on the same cache line are subject to false
    /// sharing, which is very bad for performance.  Microbenchmarks
    /// indicate that pairs of cache lines also see interference under
    /// heavy use of atomic operations (observed for atomic increment on
    /// Sandy Bridge).  See FOLLY_ALIGN_TO_AVOID_FALSE_SHARING
    kFalseSharingRange = 128
};
While it gives me a bit more details, I still feel I need some insights. I'm curious about how the sync of contiguous cache lines, or any RMW operation on them could interfere with each other under heavy use of atomic operations. Can someone please enlighten me on how this can even possibly happen?
As Hans pointed out in a comment, some info about this can be found in "Intel® 64 and IA-32 architectures optimization reference manual", in section 3.7.3 "Hardware Prefetching for Second-Level Cache", about the Intel Core microarchitecture:
"Streamer — Loads data or instructions from memory to the second-level cache. To use the streamer, organize the data or instructions in blocks of 128 bytes, aligned on 128 bytes. The first access to one of the two cache lines in this block while it is in memory triggers the streamer to prefetch the pair line."
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With