I have this assembly directive called .p2align that is being generated by gcc from the source of a C program.
As I understand aligned access is faster that the unaligned one, also an asm program doesn't automatically align the memory locations or optimize memory access, so you have to do this.
I can't really read this .p2align 4,,15, especially the last part, that 15.
Skipping the fact that apparently gcc generates 2 , instead of just 1, as reported by many docs; what I get is that this piece of asm aligns memory in such a way that each location occupies 2^4 bits, which means 16 bit, so I think that it's fair to say that a WORD is 16 bit long in this case.
Now what 15 possibly means ? It's a number of bits for what ? Does the counting start from 0 so the "real" quantity is 16 instead of 15 ?
EDIT:
I just translated the same C source to both 32 bit and 64 asm code, the memory is always aligned in the same exact way with the same directive .p2align 4,,15. Why is that ?
The .p2align directive is documented here.
The first expression is the power-of-two byte alignment required. .p2align 4 pads to align on a 16-byte boundary. .p2align 5 - a 32-byte boundary, etc.
The second expression is the value to be used as padding. For x86, it's best to leave this and let the assembler choose, since there are a range of instructions that are effective no-ops. In some alignment directives, you'll see 0x90, which is the NOP instruction.
The final expression is the maximum number of bytes for padding - if the alignment requires more than this, skip the directive. In this case - 4,,15 - it does nothing, since 15 is the maximum number of bytes required to yield 16-byte alignment anyway.
The p2 part of the directive name came from gas being possibly the original implementation of the recommendation for Intel P-II CPU to provide conditional alignment of loop body code. As Agner Fog explains, the original purpose was to ensure that the first instruction fetch gets sufficient code to begin decoding.
There is also an interaction with the Loop Stream Detector, which may fail to kick in if there are extra instruction cache line fragments used at the top and bottom of the loop. Alignment is made conditional so as to avoid consuming more memory than necessary, and to avoid excessive time requirement in the case where the padding bytes are executed. gcc makes different choices of alignment, depending on the mtune target setting.
There have been targets where 2 alignment directives are set, for example to make unconditional 8-byte alignment and conditional 32-byte alignment. The reason for choosing various nop patterns is to minimize the time taken in the case where the padding stream is executed (when execution enters the loop from above). For example, a prefixed instruction which copies a register to itself can consume code bytes faster than single byte nops. This makes no difference in the case originally alluded to in this thread. So, part of the confusion may come from this alignment directive having features which aren't relevant to setting data alignments, although the directive is used also for that purpose.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With