Performance regression 1.77.0 -> 1.78.0?

I noticed that by switching from rustc 1.77.0 to 1.78.0, some code I have apparently runs significantly slower (full optimised build) than before.

The code is in git@github.com:tallinn1960/rain_collected.git. It has a criterion benchmark, which shows that the function compute_rain_collected runs significantly slower when compiled by 1.78.0, a 176% regression.

compute_rain_collected_trap/compute_rain_collected
                        time:   [23.291 ms 23.338 ms 23.386 ms]
                        change: [+175.17% +176.07% +176.89%] (p = 0.00 < 0.05)
                        Performance has regressed.

Is this a criterion flaw or is it a compiler regression? All other tested functions in that benchmark perform the same with both compiler versions though.

This is on a Mac mini M1.

2 Likes

Try cargo-show-asm to see if the generated code has changed. Maybe that will have a clue why.

1 Like

What target triple are you using?

x86 Compiler Explorer
aarch64 Compiler Explorer

For aarch64, the 1.78 version has fewer branches which should be beneficial but might be worse if the branch predictor does a good job for the use case.

One thing you might do to make LLVM's life easier:

Rather than iterating over &i64,

instead do

let mut height = height.iter().copied();

(And correspondingly change |acc, &x| to |acc, x|.)

LLVM is much happier thinking about simple no-provenance no-races values like i64 rather than pointers. So I'd be curious if that changes things.

1 Like

I tried this on x86_64-pc-windows-msvc and the same regression occurs. Nightly (2024-05-27) is just as bad, and 1.79 beta is in the middle.

1.77    443.545µs
1.78    872.203µs
beta    657.485µs
nightly 876.462µs

The order changes a lot depending on how many iterations occur.

This made 1.77, 1.78, and nightly about the same as previous beta, and beta now takes the same as previous nightly/1.78. In my experience copied sucks, but I'm always hopeful it'll be fixed someday.

I'm gonna turn this into a procedural loop next and see how that goes.

Edit: while loop is worse, but when the iterator is copied it's fast in everything except nightly!

pub fn compute_rain_collected(height: &[i64]) -> u64 {
    let mut height = height.iter().copied();

    let mut state = (height.next(), height.next_back());
    let mut acc = (i64::MIN, 0u64);
    while let (Some(left), Some(right)) = state {
        let x = if left <= right {
            state = (height.next(), Some(right));
            left
        } else {
            state = (Some(left), height.next_back());
            right
        };

        let stepsize = x.max(acc.0);
        acc = (stepsize, acc.1 + (stepsize - x) as u64);
    }
    acc.1
}

Well, the code has changed, but I do not understand enough of arm64 code to understand why it got worse:

The default target triple for Rust on M1 is aarch64-apple-darwin.

I mark this as a solution, as it restores the original speed of the solution on my machine.

Seems like 1.77.0 did that optimisation on its own, and 1.78.0 doesn't any longer? I wish I understood more of arm64 code.

Thanks for those responses.

1 Like

There's some interesting pointer provenance questions about slice iterators that are coming up now that LLVM is starting to try to actually fix a bunch of long-time bugs around it.

You might be interested in this zulip thread about a change in LLVM19 (https://rust-lang.zulipchat.com/#narrow/stream/187780-t-compiler.2Fwg-llvm/topic/pointer.20equality.20propagation/near/441362175) or this speculative PR about giving them only a single provenance (Make slice iterators carry only a single provenance by scottmcm · Pull Request #122971 · rust-lang/rust · GitHub).

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.