So I am trying to use the SSE function __mm_load_128, I am very new to SSE fo forgive me if I have made some silly mistakes somewhere.
Here is the code
void one(__m128i *arr, char *temp)
{
    // SSE needs 16 byte alignment.
    _declspec (align(16)) __m128i *tmp = (__m128i*) temp;
    if (((uintptr_t)tmp & 15) == 0)
        printf("Aligned pointer");
    else 
        printf("%d", ((uintptr_t)tmp & 15)); // This prints as 12
    arr[0] = _mm_load_si128(tmp);
}
I get an error on visual studio
0xC0000005: Access violation reading location 0xFFFFFFFF.
0xFFFFFFFF  does not look right, what am I doing wrong.
arr argument is initialized as _m128i arr[5] = { 0 }
Alternative would be to use _mm_loadu_128 which works fine but as I understand it, It should produce movdqu instruction but this is the assembly generated 
    arr[0] = _mm_loadu_si128(tmp);
00D347F1  mov         eax,dword ptr [tmp]  
00D347F4  movups      xmm0,xmmword ptr [eax]  
00D347F7  movaps      xmmword ptr [ebp-100h],xmm0  
00D347FE  mov         ecx,10h  
00D34803  imul        edx,ecx,0  
00D34806  add         edx,dword ptr [arr]  
00D34809  movups      xmm0,xmmword ptr [ebp-100h]  
00D34810  movups      xmmword ptr [edx],xmm0 
Thanks guys, From the answers I realize I have made couple of mistakes.
Align the source use _alinged_malloc
Compile with optimizations.
Use C++ casts not C
I can see three problems here:
one, it's impossible to change the alignment of arr or temp.Let's focus on point number 2 for a second - there's a pointer, and there's what the pointer points to. I guess you already know the difference between these two.
basically , when you write _declspec (align(16)) __m128i *tmp you tell the program:
When you allocate the pointer
tmpon the stack, make sure the the first byte oftmpis allocated on an address (on the stack) which is dividable by 16.
So great, tmp itself is aligned to 16, it doesn't affect at all what tmp points to. you need temp to point to align data already. this can be done by
alignas keyword (alignas(16) char my_buffer[16*100];)aligned_alloc, or MSVC's _aligned_malloc which requires _aligned_free.  See How to solve the 32-byte-alignment issue for AVX load/store operations?
You cannot align memory retroactively, it has to be allocated aligned in the first place. make sure the data passed by temp is already aligned, or use unaligned loads/stores if you can't require callers to pass aligned data. 
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With