I have an 8-bit 640x480 image that I would like to shrink to a 320x240 image:
void reducebytwo(uint8_t *dst, uint8_t *src)
//src is 640x480, dst is 320x240
What would be the best way to do that using ARM SIMD NEON? Any sample code somewhere?
As a starting point, I simply would like to do the equivalent of:
for (int h = 0; h < 240; h++)
    for (int w = 0; w < 320; w++)
        dst[h * 320 + w] = (src[640 * h * 2 + w * 2] + src[640 * h * 2 + w * 2 + 1] + src[640 * h * 2 + 640 + w * 2] + src[640 * h * 2 + 640 + w * 2 + 1]) / 4; 
This is a one to one translation of your code to arm NEON intrinsics:
#include <arm_neon.h>
#include <stdint.h>
static void resize_line (uint8_t * __restrict src1, uint8_t * __restrict src2, uint8_t * __restrict dest)
{
  int i;
  for (i=0; i<640; i+=16)
  {
    // load upper line and add neighbor pixels:
    uint16x8_t a = vpaddlq_u8 (vld1q_u8 (src1));
    // load lower line and add neighbor pixels:
    uint16x8_t b = vpaddlq_u8 (vld1q_u8 (src2));
    // sum of upper and lower line: 
    uint16x8_t c = vaddq_u16 (a,b);
    // divide by 4, convert to char and store:
    vst1_u8 (dest, vshrn_n_u16 (c, 2));
    // move pointers to next chunk of data
    src1+=16;
    src2+=16;
    dest+=8;
   }
}   
void resize_image (uint8_t * src, uint8_t * dest)
{
  int h;    
  for (h = 0; h < 240 - 1; h++)
  {
    resize_line (src+640*(h*2+0), 
                 src+640*(h*2+1), 
                 dest+320*h);
  }
}
It processes 32 source-pixels and generates 8 output pixels per iteration.
I did a quick look at the assembler output and it looks okay. You can get better performance if you write the resize_line function in assembler, unroll the loop and eliminate pipeline stalls. That would give you an estimated factor of three performance boost.
It should be a lot faster than your implementation without assembler changes though.
Note: I haven't tested the code...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With