I've got a very hot instruction loop which needs to be properly aligned on 32-bytes boundaries to maximize Intel's Instruction Fetcher effectiveness.
This issue is specific to Intel not-too-old line of CPU (from Sandy Bridge onward). Failure to align properly the beginning of the loop results in up to 20 % speed loss, which is definitely too noticeable. This issue is pretty rare, one needs a highly optimized set of instructions for the instruction fetcher to become the bottleneck. But fortunately, it's not a unique case. Here is a nice article explaining in details how such a problem can be detected.
The problem is, gcc nor clang would care aligning properly this instruction loop. It makes compiling this code a nightmare producing random outcome, depending on how "good" the hot loop is aligned by chance. It also means that modifying a totally unrelated function can nonetheless highly impact performance of the hot loop.
Already tried several compiler flags, none of them gives satisfying result.
[Edit] More detailed description of tried compilation flags :
-falign-functions=32 : no impact or negative impact-falign-jumps=32 : no impact-falign-loops=32 : works fine when the hot loop is isolated into a tiny piece of test code. But in normal build, the compilation flag is applied across the entire source, and in this case it is detrimental : aligning all loops on 32-bytes is bad for performance. Only the very hot ones benefit from it.__attribute__((optimize("align-loops=32"))) in the function declaration. Doesn't produce any effect (identical binary generated, as if the the statement wasn't there). Later confirmed by gcc support team to be effectively ignored. Edit : @Jester indicates in comment that the statement works with gcc 5+. Unfortunately, my dev station uses primarily gcc 4.8.4, and this is more a problem of portability, since I don't control the final compiler used in the build process.Only building using PGO can reliably produce expected performance, but PGO cannot be accepted as a solution since this piece of code will be integrated into other programs using their own build chain.
So, I'm considering inline assembly. This would be specific to x64 instruction set, so no portability required.
If my understanding is correct, assembly like NASM allows the use of statements such as : ALIGN 32 which would force the next instruction to be aligned on 32 bytes boundaries.
Since the target source code is in C, it would be necessary to include this statement. For example, something like asm("ALIGN 32");
(which of course doesn't work).
I hope it's mostly a matter of knowing the right instruction to write, and not something deeper such as "it's impossible".
We can write assembly program code inside c language program. In such case, all the assembly code must be placed inside asm{} block. Let's see a simple assembly program code to add two numbers in c program.
The __asm keyword invokes the inline assembler and can appear wherever a C or C++ statement is legal. It cannot appear by itself. It must be followed by an assembly instruction, a group of instructions enclosed in braces, or, at the very least, an empty pair of braces.
The asm statement allows you to include assembly instructions directly within C code. This may help you to maximize performance in time-sensitive code or to access assembly instructions that are not readily available to C programs. Note that extended asm statements must be inside a function.
In computer programming, an inline assembler is a feature of some compilers that allows low-level code written in assembly language to be embedded within a program, among code that otherwise has been compiled from a higher-level language such as C or Ada.
Similarly to NASM, the GNU assembler supports the .align  pseudo OP for alignment:
volatile asm (".align 32");
For a non-assembly solution, you could try to supply -falign-loops=32 and possibly -falign-functions=32, -falign-jumps=32 as needed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With