I am trying to use AVX2 intrinsic functions with C++. I am using floats (__m256). Now there are 8 floats that can fit in a register. But what happens if I have less than 8 floats, say I have 5. In that case, the lower 3 floats have garbage values.
float a[5] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f};
float b[5] = {2.0f, 3.0f, 4.0f, 5.0f, 6.0f};
__m256 _a = _mm256_loadu_ps(a);
__m256 _b = _mm256_loadu_ps(b);
__m256 _c = _mm256_div_ps(_a, _b);
for(int i=0; i<8; ++i)
cout << _c[i] << endl;
The result that I get it in the screenshot below:

Is there any way I can the last 3 numbers in the results to 0? I don't want to run a loop since that would defeat the purpose of using AVX. Also, the number of floats (5 in this case) is variable.
I am new to AVX and would really like some help.
In the context of the larger problem, I read the arrays from a data stream and thus don't know the size of the array beforehand to be able to append 0 at the end of the arrays without running a loop.
float a[5] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f};
float b[5] = {2.0f, 3.0f, 4.0f, 5.0f, 6.0f};
__m256 _a = _mm256_loadu_ps(a);
__m256 _b = _mm256_loadu_ps(b);
This is undefined behavior because you are reading beyond the array.
You can clear all the elements in _a and _b with _mm256_setzero_ps():
__m256 _a = _mm256_setzero_ps;
__m256 _b = _mm256_setzero_ps;
Loading 5 elements into the __m256 register is a little trickier. If possible, you can declare it with 8 elements. I believe C++ will value initialize with 0.0f.
float a[8] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f};
float b[8] = {2.0f, 3.0f, 4.0f, 5.0f, 6.0f};
If you can't declare the array with 8 elements, then I would probably try something like this with GCC and Clang:
__m256 _a = _mm256_setzero_ps(), _b = _mm256_setzero_ps();
memcpy(&_a, a, 5*sizeof(float));
memcpy(&_b, b, 5*sizeof(float));
You can also copy to an intermediate array and allow the compiler to optimize:
float a[5] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f};
float b[5] = {2.0f, 3.0f, 4.0f, 5.0f, 6.0f};
float t[0] = {0.0f};
memcpy(t, a, 5*sizeof(float));
__m256 _a = _mm256_loadu_ps(t);
memcpy(t, b, 5*sizeof(float));
__m256 _b = _mm256_loadu_ps(t);
(Editor's note: this will likely compile to about the same asm as memcpy into the __m256 object. With current compilers, it will actually copy to the stack and result in a store-forwarding stall when reloaded.)
A final possibility is loading one full __m128, setting the one element in a second __m128, and then combining the two __m128 into a __m256. I don't have a lot of experience with it, but this may do what you want. I did not test it:
float a[5] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f};
float b[5] = {2.0f, 3.0f, 4.0f, 5.0f, 6.0f};
__m256 _a = _mm256_set_m128 (_mm_loadu_ps(a+0), _mm_load_ps1(a+4));
__m256 _b = _mm256_set_m128 (_mm_loadu_ps(b+0), _mm_load_ps1(b+4));
The _mm_load_ps1 will broadcast the the first element (a[4] or b[4]) into the remaining elements. The remaining elements will not be 0, but they won't be random garbage either. When you carry out your calculation, you treat them as "don't cares".
If you truly need the last three elements to be 0.0f, then this should do. But I believe it will cost you two extra instructions as opposed to _mm_load_ps1.
// x set to {5.0f, 0.0f, 0.0f, 0.0f}
__m128 x = _mm_insert_ps(_mm_setzero_ps(), _mm_load_ps1(a+4), 0);
The full statement for a would look like:
__m256 _a = _mm256_set_m128 (_mm_loadu_ps(a+0),
_mm_insert_ps(_mm_setzero_ps(), _mm_load_ps1(a+4), 0));
And before you exit your routine that processes the __m256 datatypes, you may need to call _mm256_zeroupper. See questions like Using AVX CPU instructions: Poor performance without “/arch:AVX” and Using xmm parameter in AVX intrinsics.
Regardless of what you decide, you should benchmark the performance of your application to see which is best for your program.
Also see the Intel Intrinsics Guide.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With