Why do bit manipulation intrinsics like _bextr_u64 often perform worse than simple shift and mask operations?

Question

I'm working on performance-critical code and experimenting with different ways to extract bit fields from 64-bit integers. I expected that using intrinsics like _bextr_u64 (Bit Field Extract) would be faster or at least as fast as using manual bit manipulation with shifts and masks, like:

uint64_t value = (x >> start) & ((1ULL << len) - 1);

However, benchmarking shows that the manual version is often faster than using _bextr_u64. This seems counterintuitive since the intrinsic maps directly to a single x86 instruction (BEXTR), while the manual version requires multiple instructions.

For example for this piece of code:

#define N 1000000

std::vector<uint_fast64_t> keys(N);
size_t countOfOnes=0;

...

for (size_t i = 0; i < N; ++i)
{
    countOfOnes += (keys[i] >> 16) & 1;
}

VS

for (size_t i = 0; i < n; ++i) 
{
    countOfOnes += _bextr_u64(keys[i], 16, 1);
}

My specs: Ryzen 7 4800H (Zen 2), g++ (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0

Why does this happen?

Peter Cordes · Accepted Answer

GCC and Clang won't auto-vectorize _bextr_u64, so you get what you asked for: scalar code which is about 4x slower than what you could be doing with AVX2, and about 3x slower than what GCC -O3 -march=znver2 actually does with AVX2. GCC doesn't unroll loops by default, so this bottlenecks at 1 iteration (1 element) per clock cycle, on the loop-branch throughput.

# main loop from your bextr loop with gcc -O3 -march=znver2
# this is 4 front-end uops, bottlenecked on loop-branch throughput of 1 iter / clock
.L17:
        bextr   rcx, QWORD PTR [rax], rdi
        add     rax, 8
        add     rdx, rcx
        cmp     rsi, rax
        jne     .L17

Compilers do auto-vectorize your plain-C++ loop because it's a very simple and common pattern with no dependencies between iterations except the += reduction. (Godbolt) (Thanks to @Homer512 for wrapping these loops in simple functions.)

The vectorized plain-C++ loop is 6 uops for Zen 2. Zen 2's pipeline is 5 instructions / 6 uops wide, but none of these are 2-uop instructions so the bottleneck should be 5 uops per clock cycle best case. (https://agner.org/optimize/microarch.pdf). And it's not rare for short loops which aren't a multiple of the pipeline width to be a bit slower. 1.2 cycles per 32 bytes (4 elements) is still far faster than 1 cycle per element. (It's not optimal, though; see below for going another 2x as fast if data is hot in cache.)

# main loop from plain-C++ with gcc -O3 -march=znver2
.L4:
        vmovdqu ymm3, YMMWORD PTR [rax]
        add     rax, 32
        vpsrlq  ymm0, ymm3, 16
        vpand   ymm0, ymm0, ymm2
        vpaddq  ymm1, ymm1, ymm0
        cmp     rax, rcx
        jne     .L4

With -mno-avx (which also implies no-AVX2), you say plain C++ is still 5% faster. That's odd, it should still be close to 2x faster on your Zen 2 with GCC 13 -O3 -march=znver2, unless there's a memory bottleneck, since your plain-C++ loop still auto-vectorizes with SSE2 psrlq / pand / paddq. GCC-13's loop comes up to 6 uops (pretty much identical to the AVX2 loop but with 16-byte XMM vectors), so should run at around 6/5 = 1.2 cycles per iteration (2 elements) on average, or maybe a bit worse. (All the instructions are single-uop, and have at worst 1/clock throughput in the back-end. Zen 2 has only one execution port that handles psrlq, so you can't go any faster than 1 vector per clock until we optimize the shift out of the loop.)

You can benchmark scalar code if you want, although for your actual match-counting problem you definitely want to be using AVX2. With -fno-tree-vectorize -O3 -march=znver2, GCC and Clang unfortunately fail to use bextr for their scalar loops even though tuning for Zen 2, but Clang has a very interesting trick up its sleeve: it avoids the shift by loading the appropriate byte of the key: movzx esi, word ptr [rcx + 2] (zero-extending 16-bit load) / and esi, 1 instead of mov rsi, [rcx] / shr rsi, 16 / and esi, 1.
(GCC uses BMI2 shrx to load-and-shift as a single-uop instruction shrx r9, QWORD PTR [rax], rdi. It could have used a 32-bit load since the high bits are getting nuked anyway, but that doesn't matter.)

When vectorizing, Clang doesn't use offset loads to avoid the shift; that would make them misaligned relative to the array. If your original array (actually std::vector) uses a 32-byte aligned allocator, no loads would cross cache-line boundaries, which is good. If not, with a normal std::vector, glibc for large allocations typically uses the first 16 bytes of a page for bookkeeping, and the rest for the array, so the memory is misaligned for 32-byte and wider on GNU/Linux. I think misalignment is a pretty minor slowdown on modern-ish CPUs like Zen 2 when data is coming from L3 cache or DRAM, and maybe ok even from L2. (https://travisdowns.github.io/blog/2019/06/11/speed-limits.html#load-split-cache-lines)

Regardless, we can move the shift out of the loop so it becomes basically irrelevant whether we avoid it or not.

Making your loop faster:

If the max count is < 2^48, we can sink the shift out of the loop, just accumulating key & bit_we_want. So we just use the top 48 bits of a uint64_t as the counter. (If that's not the case, wrap an outer loop around it which accumulates 48-bit results into 64-bit results. 2^48 uint64_t elements would take 2048 TiB of RAM so you're unlikely to need this.)

std::size_t shift_sunk(const std::vector<uint64_t>& keys)
{
    std::size_t countOfOnes = 0;
    for(auto key: keys)
        countOfOnes += key & (1uL<<16);
    return countOfOnes >> 16;
}

With Clang, and with GCC with profile-guided optimization, or with manual -funroll-loops, we get an unrolled loop that only costs 2 front-end uops per 32-byte vector. (The load can fold into a memory source operand for vpand, avoiding a separate vmovdqu instruction¹.) The indexed addressing modes would make this less good for Intel CPUs like -march=icelake-client, but AMD doesn't have a problem.

# clang 20  -O3 -march=znver2
.LBB0_12:
        vpand   ymm6, ymm1, ymmword ptr [rdx + 8*rax + 32]
        vpand   ymm5, ymm1, ymmword ptr [rdx + 8*rax]
        vpand   ymm7, ymm1, ymmword ptr [rdx + 8*rax + 64]
        vpaddq  ymm2, ymm6, ymm2
        vpand   ymm6, ymm1, ymmword ptr [rdx + 8*rax + 96]
        vpaddq  ymm0, ymm5, ymm0
        vpaddq  ymm3, ymm7, ymm3
        add     rax, 16
        vpaddq  ymm4, ymm6, ymm4
        cmp     r8, rax
        jne     .LBB0_12       # }while(i < max)

        vpaddq  ymm0, ymm2, ymm0
        vpaddq  ymm0, ymm3, ymm0
        vpaddq  ymm0, ymm4, ymm0   # sum the four vector accumulators down to 1
        ... # then shuffle and add down to 1 scalar element

This bottlenecks on 2/clock vector load throughput (if data is hot in L1d cache, otherwise on waiting for data). Zen 2 can run vpand and vpaddq on any of its four vector ALU ports so they're not a bottleneck, unlike vpsrlq y,y, imm8 vector-shift which only has 1/clock throughput. https://uops.info/

GCC does something similar when unrolling, but without multiple accumulators so bottlenecks on 1/cycle vpaddq latency; silly compiler failed to use multiple vector accumulators. This is only a problem in this case if data is hot in L1d cache. (It's a much worse problem for floating-point instructions where latencies are higher; GCC's unrolling is often totally pointless with FP code.)

If data is coming from L2 cache or farther away, load bandwidth will average less than 1 vector per clock so GCC's code is fine, and the unroll will still let out-of-order exec see the loads sooner, and let the loop control execute far ahead and recover from the branch mispredict when the loop branch is not-taken one the final iterations, while the back end is still chewing on vector operations. Also with both logical cores of the physical core active, consuming fewer execution resources will leave more for the other hardware thread.

Without unrolling, GCC -O3 -march=znver2 does a nice job:

# shift_sunk() with GCC -O3 -march=znver2
.L4:
        vpand   ymm1, ymm2, YMMWORD PTR [rax]
        add     rax, 32
        vpaddq  ymm0, ymm0, ymm1
        cmp     rax, rcx
        jne     .L4
   # then horizontal sum YMM0 to a single uint64_t, then right shift it ...

This loop is only 4 uops (cmp/jne will macro-fuse into a single uop), so will run great even on older Intel CPUs like Skylake. (The non-index addressing mode load will stay micro-fused with the vpand ALU uop, unlike with Clang's loop which would un-laminate into 2 uops at issue/rename on all Intel P-cores, as far as I know. I mentioned Skylake because its pipeline is 4 uops wide, not because there was any change in how un-lamination of indexed addressing modes works with AVX instructions. You can't really do anything to get compilers to make pointer-increment vs. indexed addressing modes in loops, except trial and error; it's a missed-optimization bug in GCC and Clang that they don't know how Intel CPUs work.)

This loop is more than fine if your data isn't hot in L1d cache, and doesn't take nearly as much extra code to handle shorter or odd counts for cases where the compiler can't see the iteration count as a compile-time constant. (GCC especially emits a lot of code before/after the main unrolled loop, when unrolling.) There might be something to gain from unrolling with data hot in L2, but probably not much.

Footnote 1:

With vpsrlq as the first operation, only the AVX-512 version allows a memory source. The SSE version was single operand so always required a register, and the AVX version with a separate destination chose not to extend it to allow a memory source for some reason. Maybe they didn't want to special-case decoding of the /r field as part of the opcode (like x86 does for many instructions with an immediate operand) but still allowing the r/m operand to be reg/mem for AVX but reg-only for SSE. The EVEX version does use the same

A smarter compiler would have done vpand y, y, mem loads and then shifted, but that would require an optimization pass specific to that x86 quirk, and most transformations/optimizations happen in GIMPLE which is target-agnostic. Normally it is better to shift then mask in your source code, since 1 fits in an 8-bit immediate on x86-64, and fits in an immediate at all on ISAs like RISC-V, unlike 65536.

The title question, without vectorizable loops

bextr is a 2-uop instruction on Intel P-cores (https://uops.info/), so it runs internally the same as a shift and AND. It's good on AMD, though, decoding to a single uop. It's also single-uop on Intel E-cores like Gracemont (Alder Lake), but has surprisingly high latency there (4 cycles from data to result, 5 cycles from control to result).

With runtime-variable inputs it requires packing the control inputs into a single register, which makes it unavoidably bad if those aren't loop-invariant. (Older AMD used to support an immediate-control version; Intel never has.) It's pretty good on AMD if the compiler can set up the control register once outside the loop, for stuff that can't vectorize so you need the value extracted into a scalar register.

Why do bit manipulation intrinsics like _bextr_u64 often perform worse than simple shift and mask operations?

Tags:

bit-manipulation

gcc

x86-64

avx

intrinsics

caquis caquis

1 Answers

Making your loop faster:

Footnote 1:

The title question, without vectorizable loops

Peter Cordes

Recent Activity

Donate For Us

Why do bit manipulation intrinsics like _bextr_u64 often perform worse than simple shift and mask operations?

Tags:

bit-manipulation

gcc

x86-64

avx

intrinsics

caquis caquis

1 Answers

Making your loop faster:

Footnote 1:

The title question, without vectorizable loops

Peter Cordes

Related questions

Recent Activity

Donate For Us