While running some tests for the -O2 optimization of the gcc compilers, I observed the following instruction in the disassembled code for a function:
data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
What does this instruction do?
To be more detailed I was trying to understand how does the compiler optimize useless recursions like the below with O2 optimization:
int foo(void)
{
   return foo();
}
int main (void)
{
   return foo();
}
The above code causes stack overflow when compiled without optimization, but works for O2 optimized code.
I think with O2 it completely removed the pushing the stack of the function foo, but why is the data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1) needed?
0000000000400480 <foo>:
foo():
400480:       eb fe                   jmp    400480 <foo>
400482:       66 66 66 66 66 2e 0f    data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
400489:       1f 84 00 00 00 00 00
0000000000400490 <main>:
main():
400490:       eb fe                   jmp    400490 <main>
You see an operand forwarding optimization of the cpu pipeline.
Although it is an empty loop, gcc tries to optimize this as well :-).
The cpu you are running has a superscalar architecture. It means, that it has a pipeline in it, and different phases of the executions of the consecuting instructions happen parallel. For example, if there is a
mov eax, ebx ;(#1)
mov ecx, edx ;(#2)
then the loading & decoding of instruction #2 can happen already while #1 is executed.
The pipelining has major problems to solve in the case of the branches, even if they are unconditional.
For example, while the jmp is decoding, the next instruction is already prefetched into the pipeline. But the jmp changes the location of the next instruction. In such cases, the pipeline needs to by emptied and refilled, and a lot of worthy cpu cycles will be lost.
Looks this empty loop will run faster if the pipeline is filled with a no-op in this case, despite that it won't be ever executed. It is actually an optimization of some uncommon feature of the x86 pipeline.
Earlier dec alphas could even segfault from such things, and empty loops had to have a lot of no-ops in them. x86 would be only slower. This is because they must be compatible with the intel 8086.
Here you can read a lot from the handling of branching instructions in pipelines.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With