Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pack high bit of every byte in ARM, for 64 bytes like AVX512 vpmovb2m?

Tags:

c

simd

arm

arm64

neon

__builtin_ia32_cvtb2mask512() is the GNU C builtin for vpmovb2m k, zmm.
The Intel intrinsic for it is _mm512_movepi8_mask.

It extracts the most-significant bit from each byte, producing an integer mask.

The SSE2 and AVX2 instructions pmovmskb and vpmovmskb do the same thing for 16 or 32-byte vectors, producing the mask in a GPR instead of an AVX-512 mask register. (_mm_movemask_epi8 and _mm256_movemask_epi8).

  1. I would like an implementation for ARM that is faster than below
  2. I would like an implementation for ARM NEON
  3. I would like an implementation for ARM SVE

I have attached a basic scalar implementation in C. For those trying to implement this in ARM, we care about the high bit, but each byte's high bit (in a 128bit vector), can be easily shifted to the low bit using the ARM NEON intrinsic: vshrq_n_u8(). Note that I would prefer not to store the bitmap to memory, it should just be the return value of the function similar to the following function.

#define _(n) __attribute((vector_size(1<<n),aligned(1)))
typedef char V  _(6); // 64 bytes, 512 bits
typedef unsigned long U;
#undef _
U generic_cvtb2mask512(V v) {
   U mask=0;int i=0; 
   while(i<64){
     // shift mask by 1 and OR with MSB of v[i] byte
     mask=(mask<<1)|((v[i]&0x80)>>7);
     i++;}
   return mask;
}

This is one possible algorithm for 16 bytes (128b vector), it would just need to be put into a loop for 64 bytes (512b vector):

#define _(n) __attribute((vector_size(1<<n),aligned(1)))
typedef char g4 _(4); // 16 bytes, 128 bits
typedef char g3 _(3); // 8 bytes,   64 bits
typedef unsigned long U;
#undef _

unsigned short get_16msb(g4 v) {
  unsigned short = ret;

  // per byte, make every bit same as msb
  g4 msb = vdupq_n_u8(0x80);
  g4 filled = vceqq_u8(v, msb);

  // create a mask of each bit value
  g4 b = {0x80, 0x40, 0x20, 0x01, 0x08, 0x04, 0x02, 0x01,
          0x80, 0x40, 0x20, 0x01, 0x08, 0x04, 0x02, 0x01};

  // and vectors together
  g4 z = vandq_u8 (filled,b);

  // extract lower 8 bytes, hi 8 bytes
  g3 lo = vget_low_u8(z);
  g3 hi = vget_high_u8(z);

  // 'or' the 8 bytes of lo together ...
  // put in byte 1 of ret
  // 'or' the 8 bytes of hi together ...  
  // put in byte 2 of ret

  return ret;
}
like image 824
AG1 Avatar asked Jan 18 '26 08:01

AG1


1 Answers

There's a difficulty in wanting to optimise the generic, when most/best optimisations are with the specific. Especially what you want to do with the results.

eg the code for "checking if any high bit is set" can be much cheaper than "check which high bit is set".

  // per byte, make every bit same as msb
  g4 msb = vdupq_n_u8(0x80);
  g4 filled = vceqq_u8(v, msb);

won't make a difference in performance, but it's checking if the sign bit is set, so just do vcltzq_s8(v). i.e. instead of v == 0x80 just check if in signed comparison the value is negative.

If you only care about whether there is a value which has the signed bit set, for Adv SIMD you can just use vpmaxq_s8 on the result of the comparison and just do:

if (vgetq_lane_s64 (vreinterpretq_s64_s8 (res), 0))

For SVE you don't need this as the compare itself sets flags. You can do ptest on the predicate result of the compare and branch on that. The compiler should be able to remove the ptest during optimization.

If you need which element to use, there are various methods. As Peter Cordes says in comments, you can use an AND with a special mask and clz for Adv. SIMD.

These patterns are common and are essentially strchr from the standard library. So for the best sequences I'd recommend checking whatever we have in Arm optimized Routines which we constantly update as we find better ways.

for Neon: https://github.com/ARM-software/optimized-routines/blob/master/string/aarch64/strchr.S is the file and does as above.

for SVE: https://github.com/ARM-software/optimized-routines/blob/master/string/aarch64/strchr-sve.S There's some additional code there as strchr needs to check for the null terminator, but the general idea is the same.

like image 140
BenClark Avatar answered Jan 20 '26 22:01

BenClark



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!