C intrinsics, SSE2 dot product and gcc -O3 generated assembly

Question

I need to write a dot product using SSE2 (no _mm_dp_ps nor _mm_hadd_ps) :

#include <xmmintrin.h>

inline __m128 sse_dot4(__m128 a, __m128 b)
{
    const __m128 mult = _mm_mul_ps(a, b);
    const __m128 shuf1 = _mm_shuffle_ps(mult, mult, _MM_SHUFFLE(0, 3, 2, 1));
    const __m128 shuf2 = _mm_shuffle_ps(mult,mult, _MM_SHUFFLE(1, 0, 3, 2));
    const __m128 shuf3 = _mm_shuffle_ps(mult,mult, _MM_SHUFFLE(2, 1, 0, 3));

    return _mm_add_ss(_mm_add_ss(_mm_add_ss(mult, shuf1), shuf2), shuf3);
}

but I looked at the generated assembler with gcc 4.9 (experimental) -O3, and I get :

    mulps   %xmm1, %xmm0
    movaps  %xmm0, %xmm3         //These lines
    movaps  %xmm0, %xmm2         //have no use
    movaps  %xmm0, %xmm1         //isn't it ?
    shufps  $57, %xmm0, %xmm3
    shufps  $78, %xmm0, %xmm2
    shufps  $147, %xmm0, %xmm1
    addss   %xmm3, %xmm0
    addss   %xmm2, %xmm0
    addss   %xmm1, %xmm0
    ret

I am wondering why gcc copy xmm0 in xmm1, 2 and 3... Here is the code I get using the flag : -march=native (looks better)

    vmulps  %xmm1, %xmm0, %xmm1
    vshufps $78, %xmm1, %xmm1, %xmm2
    vshufps $57, %xmm1, %xmm1, %xmm3
    vshufps $147, %xmm1, %xmm1, %xmm0
    vaddss  %xmm3, %xmm1, %xmm1
    vaddss  %xmm2, %xmm1, %xmm1
    vaddss  %xmm0, %xmm1, %xmm0
    ret

Brett Hale · Accepted Answer

Here's a dot product using only original SSE instructions, that also swizzles the result across each element:

inline __m128 sse_dot4(__m128 v0, __m128 v1)
{
    v0 = _mm_mul_ps(v0, v1);

    v1 = _mm_shuffle_ps(v0, v0, _MM_SHUFFLE(2, 3, 0, 1));
    v0 = _mm_add_ps(v0, v1);
    v1 = _mm_shuffle_ps(v0, v0, _MM_SHUFFLE(0, 1, 2, 3));
    v0 = _mm_add_ps(v0, v1);

    return v0;
}

It's 5 SIMD instructions (as opposed to 7), though with no real opportunity to hide latencies. Any element will hold the result, e.g., float f = _mm_cvtss_f32(sse_dot4(a, b);

the haddps instruction has pretty awful latency. With SSE3:

inline __m128 sse_dot4(__m128 v0, __m128 v1)
{
    v0 = _mm_mul_ps(v0, v1);

    v0 = _mm_hadd_ps(v0, v0);
    v0 = _mm_hadd_ps(v0, v0);

    return v0;
}

This is possibly slower, though it's only 3 SIMD instructions. If you can do more than one dot product at a time, you could interleave instructions in the first case. Shuffle is very fast on more recent micro-architectures.

Guillaume · Answer

The first listing you paste is for SSE architectures only. Most SSE instructions support only the two operand syntax: instructions are in the form of a = a OP b.

In your code, a is mult. So if no copy is made and passes mult (xmm0 in your example) directly, its value will be overwritten and then lost for the remaining _mm_shuffle_ps instructions

By passing march=native in the second listing, you enabled AVX instructions. AVX enables SSE intructions to use the three operand syntax: c = a OP b. In this case, none of the source operands has to be overwritten so you do not need the additional copies.

C intrinsics, SSE2 dot product and gcc -O3 generated assembly

Tags:

c

assembly

sse

matovitch

2 Answers

Brett Hale

Guillaume

Recent Activity

Donate For Us

C intrinsics, SSE2 dot product and gcc -O3 generated assembly

Tags:

c

assembly

sse

matovitch

2 Answers

Brett Hale

Guillaume

Related questions

Recent Activity

Donate For Us