On this other answer [ Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs ] was stated this:
Time each iteration independently, with something even heavier than RDTSC. e.g. CPUID / RDTSC or a time function that makes a system call. Serializing instructions are inherently pipeline-unfriendly.
I think it should be the opposite. Serialized instructions are very good for pipe line. For example,
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
Assembly by g++ main.cpp -S
addl %edx, %eax
movl %eax, -4(%rbp)
movl -4(%rbp), %edx
movl %edx, %eax
sall $2, %eax
addl %edx, %eax
movl %eax, -4(%rbp)
movl -4(%rbp), %edx
movl %edx, %eax
sall $2, %eax
addl %edx, %eax
movl %eax, -4(%rbp)
movl -4(%rbp), %edx
movl %edx, %eax
sall $2, %eax
addl %edx, %eax
movl %eax, -4(%rbp)
movl -4(%rbp), %edx
movl %edx, %eax
sall $2, %eax
addl %edx, %eax
movl %eax, -4(%rbp)
movl -4(%rbp), %edx
movl %edx, %eax
sall $2, %eax
addl %edx, %eax
movl %eax, -4(%rbp)
movl -4(%rbp), %edx
movl %edx, %eax
sall $2, %eax
addl %edx, %eax
Is much better for pipe line, instead of:
for( int i = 0; i < 7; i++ )
{
sum = 5 * sum;
}
sum = sum + 5;
Assembly by g++ main.cpp -S
movl $0, -4(%rbp)
movl $0, -8(%rbp)
.L3:
cmpl $6, -8(%rbp)
jg .L2
movl -4(%rbp), %edx
movl %edx, %eax
sall $2, %eax
addl %edx, %eax
movl %eax, -4(%rbp)
addl $1, -8(%rbp)
jmp .L3
.L2:
addl $5, -4(%rbp)
movl $0, %eax
addq $48, %rsp
popq %rbp
Because each time the loop goes:
if( i < 7 )sum = sum + 5 will be discarded.sum = 5 * sum,if( i < 7 ) fail,sum = 5 * sum will be discardedsum = sum + 5 will be finally processed.You confused “serialized” with “serializing.” A serializing instruction is one that guarantees a data ordering, i.e. everything before this instruction happens before everything after this instruction.
This is bad news for super-scalar and pipelined processors which usually don't make this guarantee and have to make special accomendations for it, e.g. by flushing the pipeline or by waiting for all execution units to be finished.
Incidentally, this is some times exactly what you want in a benchmark as it forces the pipeline into a predictable state with all execution units being ready to execute your code; no stale writes from before the benchmark can cause any performance deviations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With