(Editor's note: this question was originally:   How should one access the m128i_i8 member, or members in general, of the __m128i object?, trying to use an MSVC-specific method on GCC's definition of __m128i.  But this was an XY problem and the accepted answer is about the XY problem here.  Another answer does answer this question.)
I realize that Microsoft suggests against directly accessing the members of these objects, but I need to set them and the documentation is sorely lacking.
I continue getting the error "request for member ‘m128i_i8’ in ‘(my var name)', which is of non-class type ‘wirelabel {aka __vector(2) long long int}’" which I don't understand because I've included all the correct headers and it does recognize __m128i variables.
Note1: wirelabel is a typedef for __m128i i.e. there exists in a header
typedef __m128i wirelabel 
Note2: The reason Note1 was used is explained in the following other question: tbb::cache_aligned_allocator: Getting "request for member...which is of non-class type" with __m128i. User error or bug?
Note3: I'm using the compiler g++
Note4: This following question doesn't answer mine but does discuss related information Why should you not access the __m128i fields directly?
I also know that there is a _mm_set_epi8 function but it requires you set all 8 bit sections at once and that is not an option for me currently.
Edit: I was asked for more specifics as to why I think I need to access each of the 16 8-bit parts of the __m128i object, and here is why: I have a bool array with size 'n*128' (n is a size_t) and I need to store these within an array of 'wirelabel' with size 'n'.
Now because wirelabel is just an alias/typedef (correct me if there is a difference) for __m128i, each of the 'n' indices of 128 bools can be stored in the 'wirelabel' array.
However, in order to do this I believe need to convert every 8-bits into it's signed equivalent and store it in the correct 8bit index in each 'wirelabel' pointer in the array.
So your source data is contiguous?  You should use _mm_load_si128 instead of messing around with scalar components of vector types.
Your real problem is packing an array of bool (1 byte per element in the ABI used by g++ on x86) into a bitmap.  You should do this with SIMD, not with scalar code to set 1 bit or byte at a time.
pmovmskb (_mm_movemask_epi8) is fantastic for extracting one bit per byte of input.  You just need to arrange to get the bit you want into the high bit.
The obvious choice would be a shift, but vector shift instructions compete for the same execution port as pmovmskb on Haswell (port 0). (http://agner.org/optimize/).  Instead, adding 0x7F will produce 0x80 (high bit set) for an input of 1, but 0x7F (high bit clear) for an input of 0.  (And a bool in the x86-64 System V ABI must be stored in memory as an integer 0 or 1, not simply 0 vs. any non-zero value).
Why not pcmpeqb against _mm_set1_epi8(1)?  Skylake runs pcmpeqb on ports 0/1, but paddb on all 3 vector ALU ports (0/1/5).  It's very common to use pmovmskb on the result of pcmpeqb/w/d/q, though.
#include <immintrin.h>
#include <stdint.h>
// n is the number of uint16_t dst elements
// We access n*16 bool elements from src.
void pack_bools(uint16_t *dst, const bool *src, size_t n)
{
     // you can later access dst with __m128i loads/stores
    __m128i carry_to_highbit = _mm_set1_epi8(0x7F);
    for (size_t i = 0 ; i < n ; i+=1) {
        __m128i boolvec = _mm_loadu_si128( (__m128i*)&src[i*16] );
        __m128i highbits = _mm_add_epi8(boolvec, carry_to_highbit);
        dst[i] = _mm_movemask_epi8(highbits);
    }
}
Because we want to use scalar stores when writing this bitmap, we want dst to be in uint16_t for strict-aliasing reasons.  With AVX2, you'd want uint32_t.  (Or if you did combine = tmp1 << 16 | tmp to combine two pmovmskb results.  But probably don't do that.)
This compiles into an asm loop like this (with gcc7.3 -O3, on the Godbolt compiler explorer)
.L3:
    movdqu  xmm0, XMMWORD PTR [rsi]
    add     rsi, 16
    add     rdi, 2
    paddb   xmm0, xmm1
    pmovmskb        eax, xmm0
    mov     WORD PTR [rdi-2], ax
    cmp     rdx, rsi
    jne     .L3
So it's not wonderful (7 fuse-domain uops -> front-end bottleneck at 16 bools per ~1.75 clock cycles). Clang unrolls by 2, and should manage 16 bools per 1.5 cycles.
Using a shift (pslld xmm0, 7) would only run at one iteration per 2 cycles on Haswell, bottlenecked on port 0.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With