Why does Criterion produce inconsistent output with subsequent runs?

Question

Let's say I have the following benchmark to test how log::info! impacts the performance of the code.

Side note: the log crate in rust is only a facade, and in order to generate output it requires a compatible implementation. Since no implementation is present in the code below, I wanted to see if the compiler could optimize it out.

Here is the code:

use std::collections::HashSet;

use criterion::{criterion_group, criterion_main, Criterion};

fn f<const LOG: bool>() -> usize {
    let mut x = 0;
    let mut hs = HashSet::<i32>::new();
    for i in 0..10000 {
        x += i;
        if hs.contains(&x) {
            x *= 2;
        } else {
            x *= 3;
        }
        x = x % 1000;
        hs.insert(x);
        if LOG {
            log::info!("{}: {}", i, hs.len());
        }
    }

    hs.len()
}

fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("without", |b| b.iter(|| f::<false>()));
    c.bench_function("with", |b| b.iter(|| f::<true>()));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

Running

cargo clean
cargo bench

Produces

without                 time:   [265.71 µs 272.44 µs 280.77 µs]
Found 15 outliers among 100 measurements (15.00%)
  6 (6.00%) high mild
  9 (9.00%) high severe

with                    time:   [276.52 µs 277.55 µs 278.80 µs]
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) high mild
  9 (9.00%) high severe

Then, running the benchmark again

cargo bench

Produces

without                 time:   [261.80 µs 263.03 µs 264.49 µs]
                        change: [-7.5920% -3.7254% +0.1192%] (p = 0.06 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  6 (6.00%) high mild
  6 (6.00%) high severe

with                    time:   [264.31 µs 265.26 µs 266.32 µs]
                        change: [-9.9053% -6.0486% -2.2028%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

For the third time, running the benchmark again

cargo bench

Produces

without                 time:   [251.02 µs 251.39 µs 251.83 µs]
                        change: [-7.6265% -4.5399% -1.4715%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe

with                    time:   [251.56 µs 251.94 µs 252.38 µs]
                        change: [-8.3006% -5.3281% -2.1746%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) high mild
  10 (10.00%) high severe

Why does each consequent run show performance improvement? The code and and the environment are identical.

E_net4 stands with Ukraine · Accepted Answer

The Criterion.rs user guide answers this question fairly well.

Typically this happens because the benchmark environments aren't quite the same. There are a lot of factors that can influence benchmarks. Other processes might be using the CPU or memory. Battery-powered devices often have power-saving modes that clock down the CPU (and these sometimes appear in desktops as well). If your benchmarks are run inside a VM, there might be other VMs on the same physical machine competing for resources.

However, sometimes this happens even with no change. It's important to remember that Criterion.rs detects regressions and improvements statistically. There is always a chance that you randomly get unusually fast or slow samples, enough that Criterion.rs detects it as a change even though no change has occurred. In very large benchmark suites you might expect to see several of these spurious detections each time you run the benchmarks.

Unfortunately, this is a fundamental trade-off in statistics. In order to decrease the rate of false detections, you must also decrease the sensitivity to small changes. Conversely, to increase the sensitivity to small changes, you must also increase the chance of false detections. Criterion.rs has default settings that strike a generally-good balance between the two, but you can adjust the settings to suit your needs.

One can configure the benchmark to use a different sample count as described in Advanced Configuration, thus reducing noise at the cost of sensitivity:

fn bench(c: &mut Criterion) {
    let mut group = c.benchmark_group("sample-size-example");
    // Configure Criterion.rs to detect smaller differences and increase sample size to improve
    // precision and counteract the resulting noise.
    group.significance_level(0.1).sample_size(500);
    group.bench_function("my-function", |b| b.iter(|| my_function());
    group.finish();
}

Why does Criterion produce inconsistent output with subsequent runs?

Tags:

benchmarking

microbenchmark

rust

Roy Varon

1 Answers

E_net4 stands with Ukraine

Recent Activity

Donate For Us

Why does Criterion produce inconsistent output with subsequent runs?

Tags:

benchmarking

microbenchmark

rust

Roy Varon

1 Answers

E_net4 stands with Ukraine

Related questions

Recent Activity

Donate For Us