I've got some code, originally given to me by someone working with MSVC, and I'm trying to get it to work on Clang. Here's the function that I'm having trouble with:
float vectorGetByIndex( __m128 V, unsigned int i )
{
    assert( i <= 3 );
    return V.m128_f32[i];
}
The error I get is as follows:
Member reference has base type '__m128' is not a structure or union.
I've looked around and found that Clang (and maybe GCC) has a problem with treating __m128 as a struct or union. However I haven't managed to find a straight answer as to how I can get these values back. I've tried using the subscript operator and couldn't do that, and I've glanced around the huge list of SSE intrinsics functions and haven't yet found an appropriate one.
As a modification to hirschhornsalz's solution, if i is a compile-time constant, you could avoid the union path entirely by using a shuffle:
template<unsigned i>
float vectorGetByIndex( __m128 V)
{
    // shuffle V so that the element that you want is moved to the least-
    // significant element of the vector (V[0])
    V = _mm_shuffle_ps(V, V, _MM_SHUFFLE(i, i, i, i));
    // return the value in V[0]
    return _mm_cvtss_f32(V);
}
A scalar float is just the bottom element of an XMM register, and the upper elements are allowed to be non-zero; _mm_cvtss_f32 is free and will compile to zero instructions.  This will inline as just a shufps (or nothing for i==0).
Compilers are smart enough to optimize away the shuffle for i==0 (except for long-obsolete ICC13) so no need for an if (i).  https://godbolt.org/z/K154Pe.  clang's shuffle optimizer will compile vectorGetByIndex<2> into movhlps xmm0, xmm0 which is 1 byte shorter than shufps and produces the same low element.  You could manually do this with switch/case for other compilers since i is a compile-time constant, but 1 byte of code size in the few places you use this while manually vectorizing is pretty trivial.
Note that SSE4.1 _mm_extract_epi32(V, i); is not a useful shuffle here: extractps r/m32, xmm, imm can only extract the FP bit-pattern to an integer register or memory (https://www.felixcloutier.com/x86/extractps).  (And the intrinsic returns it as an int, so it would actually compile to extractps + cvtsi2ss to do int->float conversion on the FP bit-pattern, unless you type-pun it in your C++ code.  But then you'd expect it to compile to extractps eax, xmm0, i / movd xmm0, eax which is terrible vs. shufps.)
The only case where extractps would be useful is if the compiler wanted to store this result straight to memory, and fold the store into the extract instruction.  (For i!=0, otherwise it would use movss).  To leave the result in an XMM register as a scalar float, shufps is good.
(SSE4.1 insertps would be usable but unnecessary: it makes it possible to zero other elements while taking an arbitrary source element.)
A union is probably the most portable way to do this:
union {
    __m128 v;    // SSE 4 x float vector
    float a[4];  // scalar array of 4 floats
} U;
float vectorGetByIndex(__m128 V, unsigned int i)
{
    U u;
    assert(i <= 3);
    u.v = V;
    return u.a[i];
}
Use
template<unsigned i>
float vectorGetByIndex( __m128 V) {
    union {
        __m128 v;    
        float a[4];  
    } converter;
    converter.v = V;
    return converter.a[i];
}
which will work regardless of the available instruction set.
Note: Even if SSE4.1 is available and i is a compile time constant, you can't use pextract etc. this way, because these instructions extract a 32-bit integer, not a float:
// broken code starts here
template<unsigned i>
float vectorGetByIndex( __m128 V) {
    return _mm_extract_epi32(V, i);
}
// broken code ends here
I don't delete it because it is a useful reminder how to not do things.
The way I use is
union vec { __m128 sse, float f[4] };
float accessmember(__m128 v, int index)
{
    vec v.sse = v;
    return v.f[index];
}
Seems to work out pretty well for me.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With