Seems that optimizer has issues with closures

Or I found a way to drop code performance without doing anything stupid.

So consider this snippet, that solves Max subarray problem in linear time:

pub fn max_subarray_bad(arr: &[i32]) -> (usize, usize, i32)
{
    let prefixes = arr
        .iter()
        .enumerate()
        .scan((0, 0), |s, (i, v)| {
            if s.1 > 0 {
                s.1 = s.1 + *v;
            } else {
                *s = (i, *v);
            }
            Some(*s)
        });
    let (right_idx, (left_idx, sum)) = prefixes
        .enumerate()
        .max_by_key(|&(_, (_, sum))| sum)
        .unwrap();
    
    (left_idx, right_idx + 1, sum)
}

Since it's execution time doesn't depend on actual data in array, we can easily benchmark it by providing a sized N vector filled with zeros like this:

use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn benchmark_linear(c: &mut Criterion) {
    const N: usize = 1000000;
    c.bench_function(&format!("max_subarray([..]) N = {}", N), |b| {
        b.iter(|| max_subarray::max_subarray_bad(black_box(&vec![0; N])))
    });
}

criterion_group!(benches, benchmark_linear);
criterion_main!(benches);

We use a criterion crate for benchmark, the output of cargo bench would be

Benchmarking max_subarray([..]) N = 1000000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 16.6s or reduce sample count to 30.
max_subarray([..]) N = 1000000
                        time:   [3.2324 ms 3.2700 ms 3.3141 ms]
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe

     Running target\release\deps\scratch-b68a42551ab01289.exe

Ok, now let's change our max_subarray_bad by binding 0 to zro and use it in closure instead of 0 literal const:

pub fn max_subarray_bad(arr: &[i32]) -> (usize, usize, i32)
{
    let zro = 0;
    let prefixes = arr
        .iter()
        .enumerate()
        .scan((0, 0), |s, (i, v)| {
            if s.1 > zro {
                s.1 = s.1 + *v;
            } else {
                *s = (i, *v);
            }
            Some(*s)
        });
    let (right_idx, (left_idx, sum)) = prefixes
        .enumerate()
        .max_by_key(|&(_, (_, sum))| sum)
        .unwrap();
    
    (left_idx, right_idx + 1, sum)
}

Surprisingly our benchmark would greatly improve:

Benchmarking max_subarray([..]) N = 1000000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 12.9s or reduce sample count to 40.
max_subarray([..]) N = 1000000
                        time:   [2.5705 ms 2.5806 ms 2.5913 ms]
                        change: [-20.260% -19.668% -19.124%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

     Running target\release\deps\scratch-b68a42551ab01289.exe

Almost 20% performance increase! And we just replaced a literal 0 in s.1 > 0 with a name, that binds to 0: s.1 > zro. You can change code back and forth by replacing 0 and zro in > zro and you allways get ~20% performance difference. And I checked that it is reprodusable both in stable and nightly. Is it a bug in optimizer? Or I miss something at how closures actually treat literal values?

By the way, if we change let zro = 0 into const zro: i32 = 0 it results in performance drop too.

Update: I also created an issue on github on main rust-lang repo.

1 Like

Sounds like bug report fuel to me :wink:

1 Like

Well, I wasn't sure about if I may missunderstood something in Rust, since I learning it for about week. Nevertheless if there are more people than just only me thinking that it is a bug indeed, then I'm going to report it on issue tracker on Github. Stay tuned :slight_smile:

2 Likes

Yep, they labeled my report as "C-bug" and "I-slow" , so it's indeed a compiler bug. Btw, I was shocked by amount of open issues about missed optimizations, I was thinking that llvm as backend solves most of optimization problems.

So the short answer is - It works as designed. You can find all the details through the github issue link in the bottom of topic.

The trick here is codegen-units setting in profile, the default one is 16, that runs 16 threads during code generation and may yield not max performance code. So for a best performance one should set codegen-units=1, with a single codegen unit both versions compile to the same code with same performance.

1 Like

Ah, yes, I've been bitten by this one before. I do set codegen-units=1 in my performance-sensitive projects, because parallel codegen reduces optimization quality and makes compiler optimizer behavior less predictable.

OTOH, it does fulfill its intended purpose of making release builds faster at a low runtime performance cost, so given how often people complain about rustc's performance, I can understand why it's enabled by default.