I'm trying to enhance the performance of my code by using the 256bit vector (Intel intrinsics - AVX).
I have an I7 Gen.4 (Haswell architecture) processor supporting SSE1 to SSE4.2 and AVX/AVX2 Extensions.
This is the code snippet that I'm trying to enhance:
/* code snipet */
kfac1 = kfac  + factor;  /* 7 cycles for 7 additions */
kfac2 = kfac1 + factor;
kfac3 = kfac2 + factor;
kfac4 = kfac3 + factor;
kfac5 = kfac4 + factor;
kfac6 = kfac5 + factor;
kfac7 = kfac6 + factor;
k1fac1 = k1fac  + factor1;  /* 7 cycles for 7 additions */
k1fac2 = k1fac1 + factor1;
k1fac3 = k1fac2 + factor1;
k1fac4 = k1fac3 + factor1;
k1fac5 = k1fac4 + factor1;
k1fac6 = k1fac5 + factor1;
k1fac7 = k1fac6 + factor1;
k2fac1 = k2fac  + factor2;  /* 7 cycles for 7 additions */
k2fac2 = k2fac1 + factor2;
k2fac3 = k2fac2 + factor2;
k2fac4 = k2fac3 + factor2;
k2fac5 = k2fac4 + factor2;
k2fac6 = k2fac5 + factor2;
k2fac7 = k2fac6 + factor2;
/* code snipet */
From the Intel Manuals, I found this.
an integer addition ADD takes 1 cycle (latency).
a vector of 8 integers (32 bit) takes 1 cycle also.
So I've tried ton make it this way:
fac  = _mm256_set1_epi32 (factor )
fac1 = _mm256_set1_epi32 (factor1)
fac2 = _mm256_set1_epi32 (factor2)
v1   = _mm256_set_epi32 (0,kfac6,kfac5,kfac4,kfac3,kfac2,kfac1,kfac)
v2   = _mm256_set_epi32 (0,k1fac6,k1fac5,k1fac4,k1fac3,k1fac2,k1fac1,k1fac)
v3   = _mm256_set_epi32 (0,k2fac6,k2fac5,k2fac4,k2fac3,k2fac2,k2fac1,k2fac)
res1 = _mm256_add_epi32 (v1,fac) ////////////////////
res2 = _mm256_add_epi32 (v2,fa1) // just 3 cycles  //
res3 = _mm256_add_epi32 (v3,fa2) ////////////////////
But the problem is that these factors are going to be used as tables indexes ( table[kfac] ... ). So i have to extract the factor as seperate integers again. I wonder if there is any possible way to do it??
A smart compiler could get table+factor into a register and use indexed addressing modes to get table+factor+k1fac6 as an address.  Check the asm, and if the compiler doesn't do this for you, try changing the source to hand-hold the compiler:
const int *tf = table + factor;
const int *tf2 = table + factor2;   // could be lea rdx, [rax+rcx*4]  or something.
...
foo = tf[kfac2];
bar = tf2[k2fac6];     // could be  mov r12, [rdx + rdi*4] 
But to answer the question you asked:
Latency isn't a big deal when you have that many independent adds happening.  The throughput of 4 scalar add instructions per clock on Haswell is much more relevant.
If k1fac2 and so on are already in contiguous memory, then using SIMD is possibly worth it.  Otherwise all the shuffling and data transfer to get them in/out of vector regs makes it definitely not worth it.  (i.e. the stuff compiler emits to implement _mm256_set_epi32 (0,kfac6,kfac5,kfac4,kfac3,kfac2,kfac1,kfac).
You could avoid needing to get the indices back into integer registers by using an AVX2 gather for the table loads. But gather is slow on Haswell, so probably not worth it. Maybe worth it on Broadwell.
On Skylake, gather is fast so it could be good if you can SIMD whatever you do with the LUT results. If you need to extract all the gather results back to separate integer registers, it's probably not worth it.
If you did need to extract 8x 32-bit integers from a __m256i into integer registers, you have three main choices of strategy:
pextrd (_mm_extract_epi32).  Use _mm256_extracti128_si256 to get the high lane into a separate __m128i.Depending on the surrounding code, any of these three could be optimal on Haswell.
pextrd r32, xmm, imm8 is 2 uops on Haswell, with one of them needing the shuffle unit on port5.  That's a lot of shuffle uops, so a pure ALU strategy is only going to be good if your code is bottlenecked on L1d cache throughput.  (Not the same thing as memory bandwidth).  movd r32, xmm is only 1 uop, and compilers do know to use that when compiling _mm_extract_epi32(vec, 0), but you can also write int foo = _mm_cvtsi128_si32(vec) to make it explicit and remind yourself that the bottom element can be accessed more efficiently.
Store/reload has good throughput.  Intel SnB-family CPUs including Haswell can run two loads per clock, and IIRC store-forwarding works from an aligned 32-byte store to any 4-byte element of it.  But make sure it's an aligned store, e.g. into _Alignas(32) int tmp[8], or into a union between an __m256i and an int array.  You could still store into the int array instead of the __m256i member to avoid union type-punning while still having the array aligned, but it's easiest to just use C++11 alignas or C11 _Alignas.
 _Alignas(32) int tmp[8];
 _mm256_store_si256((__m256i*)tmp, vec);
 ...
 foo2 = tmp[2];
However, the problem with store/reload is latency. Even the first result won't be ready for 6 cycles after the store-data is ready.
A mixed strategy gives you the best of both worlds: ALU to extract the first 2 or 3 elements lets execution get started on whatever code uses them, hiding the store-forwarding latency of the store/reload.
 _Alignas(32) int tmp[8];
 _mm256_store_si256((__m256i*)tmp, vec);
 __m128i lo = _mm256_castsi256_si128(vec);  // This is free, no instructions
 int foo0 = _mm_cvtsi128_si32(lo);
 int foo1 = _mm_extract_epi32(lo, 1);
 foo2 = tmp[2];
 // rest of foo3..foo7 also loaded from tmp[]
 // Then use foo0..foo7
You might find that it's optimal to do the first 4 elements with pextrd, in which case you only need to store/reload the upper lane.  Use vextracti128 [mem], ymm, 1:
_Alignas(16) int tmp[4];
_mm_store_si128((__m128i*)tmp,  _mm256_extracti128_si256(vec, 1));
// movd / pextrd for foo0..foo3
int foo4 = tmp[0];
...
With fewer larger elements (e.g. 64-bit integers), a pure ALU strategy is more attractive. 6-cycle vector-store / integer-reload latency is longer than it would take to get all of the results with ALU ops, but store/reload could still be good if there's a lot of instruction-level parallelism and you bottleneck on ALU throughput instead of latency.
With more smaller elements (8 or 16-bit), store/reload is definitely attractive.  Extracting the first 2 to 4 elements with ALU instructions is still good.  And maybe even vmovd r32, xmm and then picking that apart with integer shift/mask instructions is good.
Your cycle-counting for the vector version is also bogus.  The three _mm256_add_epi32 operations are independent, and Haswell can run two vpaddd instructions in parallel.  (Skylake can run all three in a single cycle, each with 1 cycle latency.)
Superscalar pipelined out-of-order execution means there's a big difference between latency and throughput, and keeping track of dependency chains matters a lot. See http://agner.org/optimize/, and other links in the x86 tag wiki for more optimization guides.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With