I am doing inner product of two columns of dimension in tens of thousands. The values can only be 0, 1, or 2. They can be hence stored as characters. If to vectorize the calculation on a CPU with avx flag, I expect it to be ~32 times faster. But the problem is the multiplication automatically converts the characters to integers, which are 4 bytes. Thus a maximal of 8 times only in speed can be obtained. Can a speed of 32 times be reached?
btw, I am using Linux (Fedora 22 to date) with g++ 5.1.
Assuming you have AVX2 (not just AVX, which is only really for floating point), then you can use the vpmaddubsw instruction, for which the intrinsic is:
__m256i _mm256_maddubs_epi16 (__m256i a, __m256i b)
This performs 8 bit x 8 bit multiply (signed x unsigned, but this doesn't matter for your case) and then adds pairs of adjacent terms to give a 16 bit result. [1] This effectively gives you 32 x 8 x 8 bit bit multiplication in one instruction.
If you don't have AVX2 then you can use the 128 bit SSE version (_mm_maddubs_epi16) to get 16 x x 8 x 8 bit bit multiplication in one instruction.
Note that horizontally summing the 16 bit terms can take several instructions, but since your input range is quite small then you only need to perform this horizontal summation relatively infrequently. One possible method (for SSE):
v = _mm_madd_epi16(v, _mm_set1_epi16(1));       // unpack/sum 16 -> 32
v = _mm_add_epi32(v, _mm_srli_si128(v, 8));     // shift and add 32 bit terms
v = _mm_add_epi32(v, _mm_srli_si128(v, 4));
sum = _mm_cvtsi128_si32(v);                     // extract sum as scalar
An AVX2 implementation of the above is left as an exercise for the reader.
It looks like the AVX instruction set does not have 8-bit multiplication, only addition. The Intel intrinsics guide does not contain any 8-bit operations starting with _mm_mul*. (edit: in fact there is an 8-bit multiply, but it has a misleading name - see answer by @PaulR)
However, there is an alternate approach. Since the only allowed values are 0, 1 and 2, and you are computing an inner product, you can use bit operations instead of multiplications.
In the first vector A, use the following encoding:
0 = 0b00000000 = 0x00
1 = 0b00010011 = 0x13
2 = 0b00001111 = 0x0F
In the second vector B, use the following encoding:
0 = 0b00000000 = 0x00
1 = 0b00011100 = 0x1C
2 = 0b00001111 = 0x0F
Now compute popcount(A & B). AND-ing will cause the respective 8-bit cells to have 0, 1, 2 or 4 bits set, and the popcount will add them together. You can pack one value per 5 bits of the integer, so if you can use 256-bit ints, you can get 51x higher throughput.
I guess that it is worth trying to do it via bit operations.
Suppose that all the numbers are either 0 or 1. Then you can pack both vectors into bit arrays. The inner product is then computed by:
for (int i = 0; i < N; i += 256)
  res += popcount(A[i..i+255] & B[i..i+255]);
The and operation is naturally present in AVX/AVX2. The hardest question is how to compute popcount for YMM registers quickly.
Now suppose that we are given 0, 1, and 2. For each vector A of integers compose two bit-vectors A1 and A2:
A1[i] = (A[i] >= 1);    
A2[i] = (A[i] >= 2);
Now we can notice that:
A[i] * B[i] = A1[i] * B1[i] + A1[i] * B2[i] + A2[i] * B1[i] + A2[i] * B2[i];
So we can compute inner product with the following pseudocode:
for (int i = 0; i < N; i += 256) {
  res += popcount(A1[i..i+255] & B1[i..i+255]);
  res += popcount(A2[i..i+255] & B1[i..i+255]);
  res += popcount(A1[i..i+255] & B2[i..i+255]);
  res += popcount(A2[i..i+255] & B2[i..i+255]);
}
This allows to process 256 elements per iteration, but each iteration becomes 4 times slower. Efficiently, it is 64 elements per operation. Since popcount is likely to be the slowest part of computation, we can say that it takes N/64 popcount_256 operations to compute the inner product.
EDIT: I decided to add an small example for the idea:
A = {01212012210};  //input array A
B = {21221100120};  //input array B
A1 = {01111011110};  //A should be stored in two halves like this
A2 = {00101001100};
B1 = {11111100110};  //B is stored in similar two halves
B2 = {10110000010};
A1 & B1 = {01111000110}, popcount = 6;  //computing pairwise and-s + popcounts
A1 & B2 = {00110000010}, popcount = 3;
A2 & B1 = {00101000100}, popcount = 3;
A2 & B2 = {00100000000}, popcount = 1;
res = 6 + 3 + 3 + 1 = 13   //summing all the popcounts
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With