How to clear specific byte values inside a 64-bit value without looping

Question

Suppose I have this value, with underscores for clarity:

0x12_23_45_09_45_11_23_10

I want to zero out all instances of 0x10, 0x23 and 0x45 inside it, making this number:

0x12_00_00_09_00_11_00_00

Is there a bit twiddling hack that can do this without looping over the bytes of the value?

I am aware of how to detect byte equal to N, but I couldn't glue together the byte-clearing with multiple targets.

n. 1.8e9-where's-my-share m. · Accepted Answer

You can do it but I doubt it will be faster than the looping method.

Here's how you could zero out one particular value (note this is not hand-optimized, there could be better ways):

uint64_t zero_out(uint64_t src, uint8_t val) {
    // 1. Replicate val in every byte
    uint64_t v = val;
    v |= v << 8;
    v |= v << 16;
    v |= v << 32;
    // 2. XOR out the value
    uint64_t dst = src ^ v;
    // 3. Turn every non=zero byte to 1.
    dst |= dst >> 4; dst &= 0x0F0F0F0F0F0F0F0F;
    dst |= dst >> 2; dst &= 0x0303030303030303;
    dst |= dst >> 1; dst &= 0x0101010101010101;
    // 4. Multiply by 255 turning every 1 to FF
    dst *= 255;
    // 5. Mask out 
    return src & dst;
}

Peter Cordes · Answer

SIMD byte compares can efficiently produce the masks you want. A SIMD compare (on all ISAs except SVE and AVX-512) produces a vector where the elements are all-0 or all-1 bits (so 0x00 or 0xff for a byte compare) according to the predicate being false or true, respectively. For example a compare-for-equal will make a vector that has 0xff bytes at the matches, all-zero elsewhere.

We just need to get the scalar integer into (the bottom of) a vector register and back out. On x86 and AArch64, that should be worth it, especially if you can keep the constants around across uses.

In GNU C with its ISA-independent vector extensions, we can write code that compiles nicely for x86 and AArch64, and hopefully most other SIMD ISAs. At least with GCC. Clang manages to waste instructions especially for x86.

#ifdef __GNUC__

// clang for x86-64 wastes multiple instructions!
// clang for AArch64 wastes 1
// GCC compiles nicely.
uint64_t zero_out_stuff (uint64_t x)
{
    typedef uint8_t v8u8 __attribute__((vector_size(8)));  // total vector size in bytes, of uint8_t elements

    const unsigned char stuff[] = { 0x10, 0x23, 0x45};

    v8u8 v8x = (v8u8)x;  // like memcpy
    v8u8 clearmask = {0};
    for(size_t i=0; i<sizeof(stuff); i++) {
        v8u8 cmp = (v8x == stuff[i]);  // implicit broadcast of scalar
        clearmask |= cmp;
        // don't update v8x until after all the compares, for better ILP
    }
    v8x &= ~clearmask;
    return (uint64_t)v8x;  // type-pun back to scalar
}
#endif

AArch64 natively supports 64-bit vectors. If you were doing this for more than one uint64_t at once, you'd want to define and use v2u64 vx = {x1, x2}; or something and cast it to v16u8.

Clang will auto-vectorize the above if you loop over an array, but not very efficiently: 4x 64-bit loads it shuffles together, but one vector store. (With even more unnecessary shuffling than you'd expect from that description. Oh, because it's using vector constants with every other 64-bit chunk = 0.) Godbolt:

# AArch64 GCC 15.2  -O3 
zero_out_stuff:
        movi    v0.8b, 0x10          # broadcast the constants
        movi    v29.8b, 0x23
        fmov    d31, x0              # copy incoming scalar arg to 8-byte vector
        movi    v30.8b, 0x45
        cmeq    v0.8b, v31.8b, v0.8b
        cmeq    v29.8b, v31.8b, v29.8b       // the compares: all-0 or all-1 bits within each byte element
        cmeq    v30.8b, v31.8b, v30.8b
        orr     v29.8b, v0.8b, v29.8b        // reduce to one mask
        orr     v30.8b, v29.8b, v30.8b
        bic     v31.8b, v31.8b, v30.8b       // apply the mask: x & ~mask
        umov    x0, v31.d[0]               // move back to scalar
        ret

Unfortunately Clang for x86-64 really over-complicates this. (Except with -march=x86-64-v4 for AVX-512, then it does pretty well, although could maybe be even better with masked compare-into-mask using compare-not-equal, saving the two kand instructions.)

# Clang 21.1 -O3 -march=x86-64-v3
zero_out_stuff:
        vmovq   xmm0, rdi
        vpcmpeqb        xmm1, xmm0, xmmword ptr [rip + .LCPI0_0]
        vpcmpeqb        xmm2, xmm0, xmmword ptr [rip + .LCPI0_1]   # 16-byte constants from memory
        vpcmpeqb        xmm3, xmm0, xmmword ptr [rip + .LCPI0_2]
        vpor    xmm2, xmm2, xmm3
        vpor    xmm1, xmm1, xmm2       # reduce to one mask
        vpcmpeqd        xmm2, xmm2, xmm2   # all-ones
        vpxor   xmm1, xmm1, xmm2           # mask = ~mask
        vpsllw  xmm1, xmm1, 7              # 16-bit element, left-shift by 7
        vpand   xmm1, xmm1, xmmword ptr [rip + .LCPI0_3]   # set1(0x80)
        vpxor   xmm2, xmm2, xmm2           # all-zero
        vpcmpgtb        xmm1, xmm2, xmm1   # get back to bytes being all-0 or all-1 with signed compare against 0
        vpand   xmm0, xmm1, xmm0           # apply the mask
        vmovq   rax, xmm0

Much less goofy with Clang for AArch64, but still one wasted instruction:

# ARMv8 Clang 21  -O3
zero_out_stuff:
        movi    v0.8b, #35
        fmov    d1, x0
        movi    v2.8b, #16
        movi    v3.8b, #69
        cmeq    v0.8b, v1.8b, v0.8b
        cmeq    v2.8b, v1.8b, v2.8b
        cmeq    v3.8b, v1.8b, v3.8b
        mvn     v0.8b, v0.8b           // mask0 = ~mask1
        bic     v0.8b, v0.8b, v2.8b    // mask0 &= ~mask2
        bic     v0.8b, v0.8b, v3.8b    // mask= &= ~mask3
        and     v0.8b, v0.8b, v1.8b    // apply the mask
        fmov    x0, d0                 // IDK if fmov is better than umov x0, v0.d[0]  like GCC does.  Hopefully not worse.
        ret

A worse version of the function applies each mask right away, creating one serial dep chain of compare then bitwise. So it has less instruction-level parallelism if compiled as written (which GCC and Clang do for x86-64 and AArch64).

The more verbose scalar -> vector -> scalar compiles equivalently; I didn't realize the simpler version above was allowed until I tried it.

// slower.
// Except x86-64 Clang doesn't go nuts with it.
uint64_t zero_out_stuff_dep_chain (uint64_t x)
{
    typedef uint64_t v1u64 __attribute__((vector_size(8)));
    typedef uint8_t v8u8 __attribute__((vector_size(8)));

    const unsigned char stuff[] = { 0x10, 0x23, 0x45};

    v1u64 vx = {x};    // vector of 1 scalar element
    v8u8 v8x = (v8u8)vx;  // or just memcpy scalar to v8u8
    for(size_t i=0; i<sizeof(stuff); i++) {
        v8u8 cmp = (v8x == stuff[i]);  // implicit broadcast of scalar
        v8x &= ~cmp;
    }
    vx = (v1u64)v8x;  // cast back to a vector of 1x u64
    return vx[0];     // take the low scalar element
}

How to clear specific byte values inside a 64-bit value without looping

Tags:

c

bit-manipulation

Jordi Reinsma

2 Answers

n. 1.8e9-where's-my-share m.

Peter Cordes

Recent Activity

Donate For Us

How to clear specific byte values inside a 64-bit value without looping

Tags:

c

bit-manipulation

Jordi Reinsma

2 Answers

n. 1.8e9-where's-my-share m.

Peter Cordes

Related questions

Recent Activity

Donate For Us