I try to understand how vectorization with SSE instructions works.
Here a code snippet where vectorization is achieved :
#include <stdlib.h>
#include <stdio.h>
#define SIZE 10000
void test1(double * restrict a, double * restrict b)
{
int i;
double *x = __builtin_assume_aligned(a, 16);
double *y = __builtin_assume_aligned(b, 16);
for (i = 0; i < SIZE; i++)
{
x[i] += y[i];
}
}
and my compilation command :
gcc -std=c99 -c example1.c -O3 -S -o example1.s
Here the output for assembler code :
.file "example1.c"
.text
.p2align 4,,15
.globl test1
.type test1, @function
test1:
.LFB7:
.cfi_startproc
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L3:
movapd (%rdi,%rax), %xmm0
addpd (%rsi,%rax), %xmm0
movapd %xmm0, (%rdi,%rax)
addq $16, %rax
cmpq $80000, %rax
jne .L3
rep ret
.cfi_endproc
.LFE7:
.size test1, .-test1
.ident "GCC: (Debian 4.8.2-16) 4.8.2"
.section .note.GNU-stack,"",@progbits
I have practiced Assembler many years ago and I would like to know what represents above the registers %rdi, %rax and %rsi.
I know %xmm0 is the SIMD register where we can store 2 doubles (on 16 bytes).
But I don't understand how the simultaneous addition is performed :
I think all happens here :
movapd (%rdi,%rax), %xmm0
addpd (%rsi,%rax), %xmm0
movapd %xmm0, (%rdi,%rax)
addq $16, %rax
cmpq $80000, %rax
jne .L3
rep ret
Does %rax represents "x" array ?
What does %rsi represent in C code snippet ?
Does the final result (for example a[0]=a[0]+b[0] is stored into %rdi ?
Thanks for your help
The first thing you need to know is the calling conventions for 64-bit code on Unix systems. See Wikipedia's x86-64_calling_conventions and for much more detail read Agner Fog's calling conventions manual.
Integer parameters are passed in the following order: rdi, rsi, rdx, rcx, r8, r9. So you can pass up six integer values by register (but only four on Windows). This means in your case that:
rdi = &x[0],
rsi = &y[0].
The register rax starts at zero and increments 2*sizeof(double)=16 bytes each iteration. It is then compared with sizeof(double)*10000=80000 each iteration to test if the loop is finished.
The use of cmp here is actually an inefficiency in the GCC compiler. Modern Intel processors can fuse the cmp and jne instruction into one instruction and they can also fuse add and jne into one instruction but they cannot fuse add, cmp, and jne into one instruction. But it's possible to remove the cmp instruction.
What GCC should have done is set
rdi = &x[0] + 80000;
rsi = &y[0] + 80000;
rax = -80000
and then the loop could be done like this
movapd (%rdi,%rax), %xmm0 ; temp = x[i]
addpd (%rsi,%rax), %xmm0 ; temp += y[i]
movapd %xmm0, (%rdi,%rax) ; x[i] = temp
addq $16, %rax ; i += 2
jnz .L3 ; then loop
Now the loop counts from -80000 up to 0 and does not need the cmp instruction and the add and jnz will be fused into one micro-operation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With