Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are serializing instructions inherently pipeline-unfriendly?

Why are serializing instructions inherently pipeline-unfriendly?

On this other answer [ Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs ] was stated this:

Time each iteration independently, with something even heavier than RDTSC. e.g. CPUID / RDTSC or a time function that makes a system call. Serializing instructions are inherently pipeline-unfriendly.

I think it should be the opposite. Serialized instructions are very good for pipe line. For example,

sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;

Assembly by g++ main.cpp -S

addl    %edx, %eax
movl    %eax, -4(%rbp)
movl    -4(%rbp), %edx
movl    %edx, %eax
sall    $2, %eax
addl    %edx, %eax
movl    %eax, -4(%rbp)
movl    -4(%rbp), %edx
movl    %edx, %eax
sall    $2, %eax
addl    %edx, %eax
movl    %eax, -4(%rbp)
movl    -4(%rbp), %edx
movl    %edx, %eax
sall    $2, %eax
addl    %edx, %eax
movl    %eax, -4(%rbp)
movl    -4(%rbp), %edx
movl    %edx, %eax
sall    $2, %eax
addl    %edx, %eax
movl    %eax, -4(%rbp)
movl    -4(%rbp), %edx
movl    %edx, %eax
sall    $2, %eax
addl    %edx, %eax
movl    %eax, -4(%rbp)
movl    -4(%rbp), %edx
movl    %edx, %eax
sall    $2, %eax
addl    %edx, %eax

Is much better for pipe line, instead of:

for( int i = 0; i < 7; i++ )
{
    sum = 5 * sum;
}

sum = sum + 5;

Assembly by g++ main.cpp -S

    movl    $0, -4(%rbp)
    movl    $0, -8(%rbp)
.L3:
    cmpl    $6, -8(%rbp)
    jg  .L2
    movl    -4(%rbp), %edx
    movl    %edx, %eax
    sall    $2, %eax
    addl    %edx, %eax
    movl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
    jmp .L3
.L2:
    addl    $5, -4(%rbp)
    movl    $0, %eax
    addq    $48, %rsp
    popq    %rbp

Because each time the loop goes:

  1. Is need to perform a if( i < 7 )
  2. Adding branch prediction, for the above loop we could assume the first time the prediction will fail
  3. The instruction sum = sum + 5 will be discarded.
  4. And the next time the pipe line will do sum = 5 * sum,
  5. Until the condition if( i < 7 ) fail,
  6. Then the sum = 5 * sum will be discarded
  7. And sum = sum + 5 will be finally processed.
like image 436
user Avatar asked Jan 18 '26 03:01

user


1 Answers

You confused “serialized” with “serializing.” A serializing instruction is one that guarantees a data ordering, i.e. everything before this instruction happens before everything after this instruction.

This is bad news for super-scalar and pipelined processors which usually don't make this guarantee and have to make special accomendations for it, e.g. by flushing the pipeline or by waiting for all execution units to be finished.

Incidentally, this is some times exactly what you want in a benchmark as it forces the pipeline into a predictable state with all execution units being ready to execute your code; no stale writes from before the benchmark can cause any performance deviations.

like image 192
fuz Avatar answered Jan 20 '26 17:01

fuz



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!