Why are serializing instructions inherently pipeline-unfriendly?

Question

Why are serializing instructions inherently pipeline-unfriendly?

On this other answer [ Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs ] was stated this:

Time each iteration independently, with something even heavier than RDTSC. e.g. CPUID / RDTSC or a time function that makes a system call. Serializing instructions are inherently pipeline-unfriendly.

I think it should be the opposite. Serialized instructions are very good for pipe line. For example,

sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;

Assembly by g++ main.cpp -S

addl    %edx, %eax
movl    %eax, -4(%rbp)
movl    -4(%rbp), %edx
movl    %edx, %eax
sall    $2, %eax
addl    %edx, %eax
movl    %eax, -4(%rbp)
movl    -4(%rbp), %edx
movl    %edx, %eax
sall    $2, %eax
addl    %edx, %eax
movl    %eax, -4(%rbp)
movl    -4(%rbp), %edx
movl    %edx, %eax
sall    $2, %eax
addl    %edx, %eax
movl    %eax, -4(%rbp)
movl    -4(%rbp), %edx
movl    %edx, %eax
sall    $2, %eax
addl    %edx, %eax
movl    %eax, -4(%rbp)
movl    -4(%rbp), %edx
movl    %edx, %eax
sall    $2, %eax
addl    %edx, %eax
movl    %eax, -4(%rbp)
movl    -4(%rbp), %edx
movl    %edx, %eax
sall    $2, %eax
addl    %edx, %eax

Is much better for pipe line, instead of:

for( int i = 0; i < 7; i++ )
{
    sum = 5 * sum;
}

sum = sum + 5;

Assembly by g++ main.cpp -S

    movl    $0, -4(%rbp)
    movl    $0, -8(%rbp)
.L3:
    cmpl    $6, -8(%rbp)
    jg  .L2
    movl    -4(%rbp), %edx
    movl    %edx, %eax
    sall    $2, %eax
    addl    %edx, %eax
    movl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
    jmp .L3
.L2:
    addl    $5, -4(%rbp)
    movl    $0, %eax
    addq    $48, %rsp
    popq    %rbp

Because each time the loop goes:

Is need to perform a if( i < 7 )
Adding branch prediction, for the above loop we could assume the first time the prediction will fail
The instruction sum = sum + 5 will be discarded.
And the next time the pipe line will do sum = 5 * sum,
Until the condition if( i < 7 ) fail,
Then the sum = 5 * sum will be discarded
And sum = sum + 5 will be finally processed.

fuz · Accepted Answer

You confused “serialized” with “serializing.” A serializing instruction is one that guarantees a data ordering, i.e. everything before this instruction happens before everything after this instruction.

This is bad news for super-scalar and pipelined processors which usually don't make this guarantee and have to make special accomendations for it, e.g. by flushing the pipeline or by waiting for all execution units to be finished.

Incidentally, this is some times exactly what you want in a benchmark as it forces the pipeline into a predictable state with all execution units being ready to execute your code; no stale writes from before the benchmark can cause any performance deviations.

Why are serializing instructions inherently pipeline-unfriendly?

Tags:

c++

optimization

serialization

x86

assembly