Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does Criterion produce inconsistent output with subsequent runs?

Let's say I have the following benchmark to test how log::info! impacts the performance of the code.

Side note: the log crate in rust is only a facade, and in order to generate output it requires a compatible implementation. Since no implementation is present in the code below, I wanted to see if the compiler could optimize it out.

Here is the code:

use std::collections::HashSet;

use criterion::{criterion_group, criterion_main, Criterion};

fn f<const LOG: bool>() -> usize {
    let mut x = 0;
    let mut hs = HashSet::<i32>::new();
    for i in 0..10000 {
        x += i;
        if hs.contains(&x) {
            x *= 2;
        } else {
            x *= 3;
        }
        x = x % 1000;
        hs.insert(x);
        if LOG {
            log::info!("{}: {}", i, hs.len());
        }
    }

    hs.len()
}

fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("without", |b| b.iter(|| f::<false>()));
    c.bench_function("with", |b| b.iter(|| f::<true>()));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

Running

cargo clean
cargo bench

Produces

without                 time:   [265.71 µs 272.44 µs 280.77 µs]
Found 15 outliers among 100 measurements (15.00%)
  6 (6.00%) high mild
  9 (9.00%) high severe

with                    time:   [276.52 µs 277.55 µs 278.80 µs]
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) high mild
  9 (9.00%) high severe

Then, running the benchmark again

cargo bench

Produces

without                 time:   [261.80 µs 263.03 µs 264.49 µs]
                        change: [-7.5920% -3.7254% +0.1192%] (p = 0.06 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  6 (6.00%) high mild
  6 (6.00%) high severe

with                    time:   [264.31 µs 265.26 µs 266.32 µs]
                        change: [-9.9053% -6.0486% -2.2028%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

For the third time, running the benchmark again

cargo bench

Produces

without                 time:   [251.02 µs 251.39 µs 251.83 µs]
                        change: [-7.6265% -4.5399% -1.4715%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe

with                    time:   [251.56 µs 251.94 µs 252.38 µs]
                        change: [-8.3006% -5.3281% -2.1746%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) high mild
  10 (10.00%) high severe

Why does each consequent run show performance improvement? The code and and the environment are identical.

like image 503
Roy Varon Avatar asked Oct 19 '25 03:10

Roy Varon


1 Answers

The Criterion.rs user guide answers this question fairly well.

Typically this happens because the benchmark environments aren't quite the same. There are a lot of factors that can influence benchmarks. Other processes might be using the CPU or memory. Battery-powered devices often have power-saving modes that clock down the CPU (and these sometimes appear in desktops as well). If your benchmarks are run inside a VM, there might be other VMs on the same physical machine competing for resources.

However, sometimes this happens even with no change. It's important to remember that Criterion.rs detects regressions and improvements statistically. There is always a chance that you randomly get unusually fast or slow samples, enough that Criterion.rs detects it as a change even though no change has occurred. In very large benchmark suites you might expect to see several of these spurious detections each time you run the benchmarks.

Unfortunately, this is a fundamental trade-off in statistics. In order to decrease the rate of false detections, you must also decrease the sensitivity to small changes. Conversely, to increase the sensitivity to small changes, you must also increase the chance of false detections. Criterion.rs has default settings that strike a generally-good balance between the two, but you can adjust the settings to suit your needs.

One can configure the benchmark to use a different sample count as described in Advanced Configuration, thus reducing noise at the cost of sensitivity:

fn bench(c: &mut Criterion) {
    let mut group = c.benchmark_group("sample-size-example");
    // Configure Criterion.rs to detect smaller differences and increase sample size to improve
    // precision and counteract the resulting noise.
    group.significance_level(0.1).sample_size(500);
    group.bench_function("my-function", |b| b.iter(|| my_function());
    group.finish();
}
like image 107
E_net4 stands with Ukraine Avatar answered Oct 20 '25 18:10

E_net4 stands with Ukraine



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!