Is there a faster algorithm for max(ctz(x), ctz(y))?

Question

For min(ctz(x), ctz(y)), we can use ctz(x | y) to gain better performance. But what about max(ctz(x), ctz(y))?

ctz represents "count trailing zeros".

C++ version (Compiler Explorer)

#include <algorithm>
#include <bit>
#include <cstdint>

int32_t test2(uint64_t x, uint64_t y) {
    return std::max(std::countr_zero(x), std::countr_zero(y));
}

Rust version (Compiler Explorer)

pub fn test2(x: u64, y: u64) -> u32 {
    x.trailing_zeros().max(y.trailing_zeros())
}

Sven Marnach · Accepted Answer

I don't think there's anything better than the naive approach for the maximum. One attempt is using the identity

x + y = min(x, y) + max(x, y)

and thus

max(ctz(x), ctz(y)) = ctz(x) + ctz(y) - min(ctz(x), ctz(y))

This way, we can reduce the max function to the min function we already optimized, albeit with a few additional operations.

Here are some Rust implementations of the different approaches:

pub fn naive(x: u64, y: u64) -> u32 {
    x.trailing_zeros().max(y.trailing_zeros())
}

pub fn sum_minus_min(x: u64, y: u64) -> u32 {
    x.trailing_zeros() + y.trailing_zeros() - (x | y).trailing_zeros()
}

pub fn nielsen(x: u64, y: u64) -> u32 {
    let x_lsb = x & x.wrapping_neg();
    let y_lsb = y & y.wrapping_neg();
    let xy_lsb = x_lsb | y_lsb;
    let lsb = xy_lsb & xy_lsb.wrapping_neg();
    let xy_max_lsb = if xy_lsb == lsb { lsb } else { xy_lsb ^ lsb };
    xy_max_lsb.trailing_zeros()
}

pub fn timmermans(x: u64, y: u64) -> u32 {
    let loxs = !x & x.wrapping_sub(1);
    let loys = !y & y.wrapping_sub(1);
    return (loxs | loys).count_ones();
}

pub fn kealey(x: u64, y: u64) -> u32 {
    ((x | x.wrapping_neg()) & (y | y.wrapping_neg())).trailing_zeros()
}

Results on my machine:

ctz_max/naive           time:   [279.09 ns 279.55 ns 280.10 ns]
ctz_max/sum_minus_min   time:   [738.91 ns 742.87 ns 748.61 ns]
ctz_max/nielsen         time:   [935.35 ns 937.63 ns 940.40 ns]
ctz_max/timmermans      time:   [803.39 ns 806.98 ns 810.76 ns]
ctz_max/kealey          time:   [295.03 ns 295.93 ns 297.03 ns]

The naive implementation beats all other implementations. The only implementation that can compete with the naive one is the approach suggested by Martin Kealey. Note that the actual factors between the implementation may be even higher than the timings indicate, due to some overhead of the test harness.

It's clear that you only have like a couple of CPU instructions to spare to optimize the naive implementation, so I don't think there is anything you can do. For reference, here is the assembly emitted by the Rust compiler when these implementations are compiled as standalone functions on a modern x86_64 processor:

example::naive:
        tzcnt   rcx, rdi
        tzcnt   rax, rsi
        cmp     ecx, eax
        cmova   eax, ecx
        ret

example::sum_minus_min:
        tzcnt   rcx, rdi
        tzcnt   rax, rsi
        add     eax, ecx
        or      rsi, rdi
        tzcnt   rcx, rsi
        sub     eax, ecx
        ret

example::nielsen:
        blsi    rax, rdi
        blsi    rcx, rsi
        or      rcx, rax
        blsi    rax, rcx
        xor     edx, edx
        cmp     rcx, rax
        cmovne  rdx, rcx
        xor     rdx, rax
        tzcnt   rax, rdx
        ret

example::timmermans:
        lea     rax, [rdi - 1]
        andn    rax, rdi, rax
        lea     rcx, [rsi - 1]
        andn    rcx, rsi, rcx
        or      rcx, rax
        xor     eax, eax
        popcnt  rax, rcx
        ret

example::kealey:
        mov     rax, rdi
        neg     rax
        or      rax, rdi
        mov     rcx, rsi
        neg     rcx
        or      rcx, rsi
        and     rcx, rax
        tzcnt   rax, rcx
        ret

In the benchmarks I ran, the functions get inlined, the loops partially unrolled and some subexpressions pulled out of the inner loops, so the assembly looks a lot less clean that the above.

For testing, I used Criterion. Here is the additional code:

use criterion::{black_box, criterion_group, criterion_main, Criterion};

const NUMBERS: [u64; 32] = [
    ...
];

fn bench<F>(func: F)
where
    F: Fn(u64, u64) -> u32,
{
    for x in NUMBERS {
        for y in NUMBERS {
            black_box(func(x, y));
        }
    }
}

fn compare(c: &mut Criterion) {
    let mut group = c.benchmark_group("ctz_max");
    group.bench_function("naive", |b| b.iter(|| bench(naive)));
    group.bench_function("sum_minus_min", |b| b.iter(|| bench(sum_minus_min)));
    group.bench_function("nielsen", |b| b.iter(|| bench(nielsen)));
    group.bench_function("timmermans", |b| b.iter(|| bench(timmermans)));
    group.bench_function("kealey", |b| b.iter(|| bench(kealey)));
}

criterion_group!(benches, compare);
criterion_main!(benches);

NUMBERS was generated with this Python code, with the intention of making branch prediction for the min() function as hard as possible:

[
    random.randrange(2 ** 32) * 2 ** random.randrange(32)
    for dummy in range(32)
]

I'm running the benchmark using

RUSTFLAGS='-C target-cpu=native -C opt-level=3' cargo bench

on an 8th generation i7 processor (Whiskey Lake).

Is there a faster algorithm for max(ctz(x), ctz(y))?

Tags:

c++

algorithm

bit-manipulation

rust

micro-optimization

QuarticCat

1 Answers

Sven Marnach

Recent Activity

Donate For Us

Is there a faster algorithm for max(ctz(x), ctz(y))?

Tags:

c++

algorithm

bit-manipulation

rust

micro-optimization

QuarticCat

1 Answers

Sven Marnach

Related questions

Recent Activity

Donate For Us