I want to scale down images as fast as I can in c++. This article describes how to efficiently average 32bit rgb images down 50%. It is fast and looks good.
I have tried modifying that approach using sse intrinsics. The code below works, with or without SSE enabled. Surprisingly, though, the speedup is negligible.
Can anybody see a way of improving the SSE code. The two lines creating vars shuffle1 and shuffle2 seems two be candidates(using some clever shifting or similar).
/*
 * Calculates the average of two rgb32 pixels.
 */
inline static uint32_t avg(uint32_t a, uint32_t b)
{
    return (((a^b) & 0xfefefefeUL) >> 1) + (a&b);
}
/*
 * Calculates the average of four rgb32 pixels.
 */
inline static uint32_t avg(const uint32_t a[2], const uint32_t b[2])
{
    return avg(avg(a[0], a[1]), avg(b[0], b[1]));
}
/*
 * Calculates the average of two rows of rgb32 pixels.
 */
void average2Rows(const uint32_t* src_row1, const uint32_t* src_row2, uint32_t* dst_row, int w)
{
#if !defined(__SSE)
        for (int x = w; x; --x, dst_row++, src_row1 += 2, src_row2 += 2)
            * dst_row = avg(src_row1, src_row2);
#else
        for (int x = w; x; x-=4, dst_row+=4, src_row1 += 8, src_row2 += 8)
        {
            __m128i left  = _mm_avg_epu8(_mm_load_si128((__m128i const*)src_row1), _mm_load_si128((__m128i const*)src_row2));
            __m128i right = _mm_avg_epu8(_mm_load_si128((__m128i const*)(src_row1+4)), _mm_load_si128((__m128i const*)(src_row2+4)));
            __m128i shuffle1 = _mm_set_epi32( right.m128i_u32[2], right.m128i_u32[0], left.m128i_u32[2], left.m128i_u32[0]);
            __m128i shuffle2 = _mm_set_epi32( right.m128i_u32[3], right.m128i_u32[1], left.m128i_u32[3], left.m128i_u32[1]);
            _mm_store_si128((__m128i *)dst_row, _mm_avg_epu8(shuffle1, shuffle2));
        }
#endif
}
Transferring data between general purpose registers and SSE registers is really slow, so you should refrain from things like :
__m128i shuffle1 = _mm_set_epi32( right.m128i_u32[2], right.m128i_u32[0], left.m128i_u32[2], left.m128i_u32[0]);
__m128i shuffle2 = _mm_set_epi32( right.m128i_u32[3], right.m128i_u32[1], left.m128i_u32[3], left.m128i_u32[1]);
Shuffle the values in the SSE registers with the help of the according shuffle operations.
This should be what you are looking for :
__m128i t0 = _mm_unpacklo_epi32( left, right ); // right.m128i_u32[1] left.m128i_u32[1] right.m128i_u32[0] left.m128i_u32[0]
__m128i t1 = _mm_unpackhi_epi32( left, right ); // right.m128i_u32[3] left.m128i_u32[3] right.m128i_u32[2] left.m128i_u32[2]
__m128i shuffle1 = _mm_unpacklo_epi32( t0, t1 );    // right.m128i_u32[2] right.m128i_u32[0] left.m128i_u32[2] left.m128i_u32[0]
__m128i shuffle2 = _mm_unpackhi_epi32( t0, t1 );    // right.m128i_u32[3] right.m128i_u32[1] left.m128i_u32[3] left.m128i_u32[1]
The main problem is the use of _mm_set_epi32 to do your shuffles - unlike most SSE intrinsics, this does not map directly to a single SSE instruction - in cases such as this it generates a lot of scalar code under the hood, and causes data to be moved between memory, general purpose registers and SSE registers. Look at using appropriate SSE shuffle intrinsics for this instead.
A secondary problem is that you're doing very little computation relative to the number of loads and stores. This will tend to result in code that is bandwidth-limited rather than compute-bound and you may not see a significant performance improvement, even with ideal SSE code. Look at combining more operations in your loop so that you do more with your data while it's in cache.
If the SSE intrinsics make little/no difference then the code is probably limited by memory bandwidth.
There's a lot of loads and stores there in your code, (_mm_set_epi32 is a load as well as the obvious ones) for not much actual work.
If the loads/stores dominate the run time then no amount of fancy instructions can save you. On modern processors which are highly pipelined and re-order instructions it's probably doing a pretty good job of keeping the whole of the processor busy in the non SSE version of your code.
You can verify this is the case in a number of ways. The easiest is probably to measure what the actual throughput of your algorithm is compared to just the load/store speeds of your memory. You might also notice some difference by varying not just the implementation, but the size of the input, with sharp increases as the input exceeds the size of each level of processor cache.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With