I need to write a dot product using SSE2 (no _mm_dp_ps nor _mm_hadd_ps) :
#include <xmmintrin.h>
inline __m128 sse_dot4(__m128 a, __m128 b)
{
    const __m128 mult = _mm_mul_ps(a, b);
    const __m128 shuf1 = _mm_shuffle_ps(mult, mult, _MM_SHUFFLE(0, 3, 2, 1));
    const __m128 shuf2 = _mm_shuffle_ps(mult,mult, _MM_SHUFFLE(1, 0, 3, 2));
    const __m128 shuf3 = _mm_shuffle_ps(mult,mult, _MM_SHUFFLE(2, 1, 0, 3));
    return _mm_add_ss(_mm_add_ss(_mm_add_ss(mult, shuf1), shuf2), shuf3);
}
but I looked at the generated assembler with gcc 4.9 (experimental) -O3, and I get :
    mulps   %xmm1, %xmm0
    movaps  %xmm0, %xmm3         //These lines
    movaps  %xmm0, %xmm2         //have no use
    movaps  %xmm0, %xmm1         //isn't it ?
    shufps  $57, %xmm0, %xmm3
    shufps  $78, %xmm0, %xmm2
    shufps  $147, %xmm0, %xmm1
    addss   %xmm3, %xmm0
    addss   %xmm2, %xmm0
    addss   %xmm1, %xmm0
    ret
I am wondering why gcc copy xmm0 in xmm1, 2 and 3... Here is the code I get using the flag : -march=native (looks better)
    vmulps  %xmm1, %xmm0, %xmm1
    vshufps $78, %xmm1, %xmm1, %xmm2
    vshufps $57, %xmm1, %xmm1, %xmm3
    vshufps $147, %xmm1, %xmm1, %xmm0
    vaddss  %xmm3, %xmm1, %xmm1
    vaddss  %xmm2, %xmm1, %xmm1
    vaddss  %xmm0, %xmm1, %xmm0
    ret
Here's a dot product using only original SSE instructions, that also swizzles the result across each element:
inline __m128 sse_dot4(__m128 v0, __m128 v1)
{
    v0 = _mm_mul_ps(v0, v1);
    v1 = _mm_shuffle_ps(v0, v0, _MM_SHUFFLE(2, 3, 0, 1));
    v0 = _mm_add_ps(v0, v1);
    v1 = _mm_shuffle_ps(v0, v0, _MM_SHUFFLE(0, 1, 2, 3));
    v0 = _mm_add_ps(v0, v1);
    return v0;
}
It's 5 SIMD instructions (as opposed to 7), though with no real opportunity to hide latencies. Any element will hold the result, e.g., float f = _mm_cvtss_f32(sse_dot4(a, b);
the haddps instruction has pretty awful latency. With SSE3:
inline __m128 sse_dot4(__m128 v0, __m128 v1)
{
    v0 = _mm_mul_ps(v0, v1);
    v0 = _mm_hadd_ps(v0, v0);
    v0 = _mm_hadd_ps(v0, v0);
    return v0;
}
This is possibly slower, though it's only 3 SIMD instructions. If you can do more than one dot product at a time, you could interleave instructions in the first case. Shuffle is very fast on more recent micro-architectures.
The first listing you paste is for SSE architectures only.  Most SSE instructions support only the two operand syntax: instructions are in the form of a = a OP b.
In your code, a is mult.  So if no copy is made and passes mult (xmm0 in your example) directly, its value will be overwritten and then lost for the remaining _mm_shuffle_ps instructions
By passing march=native in the second listing, you enabled AVX instructions.  AVX enables SSE intructions to use the three operand syntax: c = a OP b.  In this case, none of the source operands has to be overwritten so you do not need the additional copies.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With