I have a long byte array and I want to remove the lower nibble (the lower 4 bits) of every byte and move the rest together such that the result occupies half the space as the input.
For example, if my input is 057ABC23, my output should be 07B2.
My current approach looks like this:
// in is unsigned char*
size_t outIdx = 0;
for(size_t i = 0; i < input_length; i += 8)
{
in[outIdx++] = (in[i ] & 0xF0) | (in[i + 1] >> 4);
in[outIdx++] = (in[i + 2] & 0xF0) | (in[i + 3] >> 4);
in[outIdx++] = (in[i + 4] & 0xF0) | (in[i + 5] >> 4);
in[outIdx++] = (in[i + 6] & 0xF0) | (in[i + 7] >> 4);
}
... where I basically process 8 bytes of input in every loop, to illustrate that I can assume input_length to be divisible by 8 (even though it's probably not faster than processing only 2 bytes per loop). The operation is done in-place, overwriting the input array.
Is there a faster way to do this? For example, since I can read in 8 bytes at a time anyway, the operation could be done on 4-byte or 8-byte integers instead of individual bytes, but I cannot think of a way to do that. The compiler doesn't come up with something itself either, as I can see the output code still operates on bytes (-O3 seems to do some loop unrolling, but that's it).
I don't have control over the input, so I cannot store it differently to begin with.
There is a general technique for bit-fiddling to swap bits around. Suppose you have a 64-bit number, containing the following nibbles:
HxGxFxExDxCxBxAx
Here by x I denote a nibble whose value is unimportant (you want to delete it). The result of your bit-operation should be a 32-bit number HGFEDCBA.
First, delete all the x nibbles:
HxGxFxExDxCxBxAx & *_*_*_*_*_*_*_*_ = H_G_F_E_D_C_B_A_
Here I denote 0 by _, and binary 1111 by * for clarity.
Now, replicate your data:
H_G_F_E_D_C_B_A_ << 4 = _G_F_E_D_C_B_A__
H_G_F_E_D_C_B_A_ | _G_F_E_D_C_B_A__ = HGGFFEEDDCCBBAA_
Notice how some of your target nibbles are together. You need to retain these places, and delete duplicate data.
HGGFFEEDDCCBBAA_ & **__**__**__**__ = HG__FE__DC__BA__
From here, you can extract the result bytes directly, or do another iteration or two of the technique.
Next iteration:
HG__FE__DC__BA__ << 8 = __FE__DC__BA____
HG__FE__DC__BA__ | __FE__DC__BA____ = HGFEFEDCDCBABA__
HGFEFEDCDCBABA__ & ****____****____ = HGFE____DCBA____
Last iteration:
HGFE____DCBA____ << 16 = ____DCBA________
HGFE____DCBA____ | ____DCBA________ = HGFEDCBADCBA____
HGFEDCBADCBA____ >> 32 = ________HGFEDCBA
All x64-86 (and most x86) cpus have SSE2.
For each 16-bit lane do
t = (x & 0x00F0) | (x >> 12).
Then use the pack instruction to truncate each 16-bit lane to 8-bits.
For example, 0xABCD1234 would become 0x00CA0031 then the pack would make it 0xCA31.
#include <emmintrin.h>
void squish_32bytesTo16 (unsigned char* src, unsigned char* dst) {
const __m128i mask = _mm_set1_epi16(0x00F0);
__m128i src0 = _mm_loadu_si128((__m128i*)(void*)src);
__m128i src1 = _mm_loadu_si128((__m128i*)(void*)(src + sizeof(__m128i)));
__m128i t0 = _mm_or_si128(_mm_and_si128(src0, mask), _mm_srli_epi16(src0, 12));
__m128i t1 = _mm_or_si128(_mm_and_si128(src1, mask), _mm_srli_epi16(src1, 12));
_mm_storeu_si128((__m128i*)(void*)dst, _mm_packus_epi16(t0, t1));
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With