Compilers are black magic, mostly good black magic :)

So, I'm currently going through some of the problems from project euler, specifically number 33.
The problem itself isn't really complicated but I noticed some strange benchmark results I would love to hear some opinions about.

Disclaimer: The following code solves the problem and if you don't want any spoilers you shouldn't keep reading :slight_smile:
playground (copy and pasted some dependencies together until all required code was there).

Now the strange part:
in the if condition starting on line 61 the first two cases (out of the four) can never happen. So I assumed commenting them out should improve performance right.
So I changed the if condition as follows:

if n_1 == d_2 && n_2 == num && d_1 == den
|| n_2 == d_1 && n_1 == num && d_2 == den
{[...]}

And the performance decreased:
Original:
[src/bin/../helper.rs:69] now.elapsed().as_secs_f64() / loops as f64 = 0.00000226712
[src/bin/../helper.rs:69] now.elapsed().as_secs_f64() / loops as f64 = 0.0000020891310000000003
[src/bin/../helper.rs:69] now.elapsed().as_secs_f64() / loops as f64 = 0.0000020770779
[src/bin/../helper.rs:69] now.elapsed().as_secs_f64() / loops as f64 = 0.0000020719742
[src/bin/problem_0033.rs:33] solve() = 100

"Improved":
[src/bin/../helper.rs:69] now.elapsed().as_secs_f64() / loops as f64 = 0.0000025229199999999997
[src/bin/../helper.rs:69] now.elapsed().as_secs_f64() / loops as f64 = 0.0000024225290000000003
[src/bin/../helper.rs:69] now.elapsed().as_secs_f64() / loops as f64 = 0.0000024351696999999998
[src/bin/../helper.rs:69] now.elapsed().as_secs_f64() / loops as f64 = 0.0000024414159199999998
[src/bin/problem_0033.rs:33] solve() = 100

Thats a whooping 20%-25% slower than before. And that because of two conditions that could never happen (and should not help the compiler avoid any bound checks I assume).

I am relatively confident that the compiler does not optimize the function away completely while measuring as the measured time per loop stays roughly the same for different numbers of iterations (100,1000,10000,100000).

1 Like

You could look at the generated assembly: https://godbolt.org/z/eKd3fc It's to different for me to understand anything, but maybe it helps you :slight_smile:

2 Likes

Dude, I ran into similarly weird things working on Bevy:

I wonder if it's related.

I commented out some of the benchmarks in the benchmark program, and it made the previous benchmarks 100% slower. Bevy is a much more complicated beast than a simple example so it could be unrelated, but still I'm curious about this now.

You ran this more than once, with similar-looking results, right?

Number of iterations in the loop is important because it reduces measurement error. But total number of measurements is important too -- you also need to measure a bunch of times, and alternate between the versions, in case the one time you measured the "improved" version just happened to coincide with the OS randomly starting a CPU heavy background task or something like that.

I also plugged the whole thing into Compiler Explorer. Most of the differences are just in register allocation and the arbitrary names of labels. There is a chunk of code that is present in the "Original" but not in the "Improved", which appears to correspond to the comparisons that have been edited out, but nothing stands out to me as a likely culprit.

Free idea: what is the performance like if you put && {return 0} at the end of each impossible condition? It changes the assembly a bit more.

            if n_1 == d_1 && n_2 == num && d_2 == den && {return 0}
                || n_2 == d_2 && n_1 == num && d_1 == den && {return 0}
                || n_1 == d_2 && n_2 == num && d_1 == den
                || n_2 == d_1 && n_1 == num && d_2 == den

If we assume it's not something about the compiled code that is slow, that leaves "some CPU magic" as the next most likely explanation. The most obviously related form of CPU magic is branch prediction. Maybe adding the extra code changes the parameters enough that the branch predictor can do a better job -- it sounds strange, but isn't impossible. The main problem with that theory is that it's quite hard to test.

4 Likes

Meta: benchmarking is hard. I suggest using something like https://bheisler.github.io/criterion.rs/book/index.html that will give more information so you can see things like code that doesn't actually depend on the number of iterations more easily.

The timings in your post are in the microseconds per iteration, which is fast enough to be affected by all kinds of not-your-code stuff -- even things like environment variable length (yes, that matters, even though it feels like it shouldn't) or where ASLR decides to put it to run it.

No, that's often not the case.

Optimizing compilers are always a tradeoff between making code better and the compile time it costs to do so, so a big part of getting good results from them is making things obvious to them. This is generally most obvious with indexing, where adding "more checks" can actually make code faster my making it easier to LLVM to remove checks later (more @ https://internals.rust-lang.org/t/more-musings-about-slices-and-arrays/13127/9?u=scottmcm).

LLVM is usually really good at getting rid of obvious "you said that already", so removing "can never happen" code will often not affect runtime.

4 Likes

Ofcourse :slight_smile:

Yes, I know. Its just one of the many problems and I won't write a benchmark for every one of them.

Assuming that the compiler didn't decide to put some of the env handling into the measured function this would really surprise me. Especially since I call the function many times and the cache should be hot after the first few runs.

That's something I also ran into multiple times already, that's why I specified that there is no indexing happening :slight_smile:
Maybe the existence of the additional conditions causes some heuristic in llvm to apply different optimizations?

Might check that later :slight_smile:

Either way, I don't really expect a definitive answer (I didn't formulate any question), just wanted to share this interesting behavior as an example of weird compiler/CPU/cache/etc behavior after some seemingly obvious "improvement".

2 Likes

This would be my first suspicion as well. There are some nightly-only compiler intrinsics that may hint the branch predictor: likely and unlikely. Anyone interested in looking deeper into this theory could make use of these.

1 Like

Actually I would think that instruction cache alignment is more likely cause than branch prediction weirdness. I remember seeing a couple of other articles/posts where microbenchmarks got weird results that they determined was due to differing cache alignment of the code, which was essentially random based on other parts of the code

5 Likes

Performance get worse too, but not as bad as with the lines completly removed:
[src/bin/../helper.rs:69] now.elapsed().as_secs_f64() / loops as f64 = 0.00000262138
[src/bin/../helper.rs:69] now.elapsed().as_secs_f64() / loops as f64 = 0.0000022892709999999997
[src/bin/../helper.rs:69] now.elapsed().as_secs_f64() / loops as f64 = 0.0000023011015
[src/bin/../helper.rs:69] now.elapsed().as_secs_f64() / loops as f64 = 0.00000229491972

I'm not exactly sure how to use them, would this one be correct?

if unlikely(n_1 == d_1 && n_2 == num && d_2 == den)
|| unlikely(n_2 == d_2 && n_1 == num && d_1 == den)
|| likely(n_1 == d_2 && n_2 == num && d_1 == den)
|| likely(n_2 == d_1 && n_1 == num && d_2 == den)
{
}

(I would assume it's wrong, and it has to be used on the whole if condition).
I testet it in all combinations I could think of (mixed likely and unlikely around different part of the condition, un/likely on the whole condition), but it didn't make a difference (at least none I could measure).

That sounds reasonable. Do you know if there is a way to change the alignment of code? I know that in C it's possible to put bytestrings in some places in order to change the alignment of later code, but I assume that is not possible in rust..

I just tried running it with "-C target-cpu=native" and the results change, making the "improved" version actually faster:
Original:
[src/bin/../helper.rs:69] now.elapsed().as_secs_f64() / loops as f64 = 0.000001949339
[src/bin/../helper.rs:69] now.elapsed().as_secs_f64() / loops as f64 = 0.0000019471862
[src/bin/../helper.rs:69] now.elapsed().as_secs_f64() / loops as f64 = 0.00000192140842
[src/bin/../helper.rs:69] now.elapsed().as_secs_f64() / loops as f64 = 0.0000019072206700000002

"Improved":
[src/bin/../helper.rs:69] now.elapsed().as_secs_f64() / loops as f64 = 0.000001777561
[src/bin/../helper.rs:69] now.elapsed().as_secs_f64() / loops as f64 = 0.0000017956498
[src/bin/../helper.rs:69] now.elapsed().as_secs_f64() / loops as f64 = 0.00000173347752
[src/bin/../helper.rs:69] now.elapsed().as_secs_f64() / loops as f64 = 0.000001723958493

I would take this as a hint that @IndianBoy42 assumption is correct and the performance difference is some alignment difference is the cause

Yes.

Somewhere on the youtubes among all the Rust and C++ conference presentations I have been watching this last year I found one discussing exactly that. Basically that reorganizing the layout of functions in the executable had a huge effect on benchmark results. Tens of percent, often a bigger effect than going form -O2 to -03 optimization or whatever.

Even the fact that different users got very different benchmark results. On the same machine! Why? Because the position of code when loaded changed from user to user with the size of the environment variable strings they had.

The bottom line was that they had made a benchmarking tool that repeatedly randomized the location of functions in memory and then produced average and standard deviation score from all the runs.

Sadly I cannot find that video presentation or remember the name of the benchmarking tool they had created. Searching around for "benchmark random code position" or whatever only turns up tons of unrelated stuff.

This kind of thought came to me when I saw the above presentation. Can't we have a tool the moves code around to get the best result?

But if it is true that the position code is loaded to impacts it's speed and that changing things like ones environment variables changes the code position, then it's hopeless. How ever you build your code to get it lined up "just so" that will all be undone when run depending where it is loaded.

Perhaps an interesting experiment for you would be to add and remove environment variables between runs of your code and see how/if thing change.

5 Likes

That's what I was thinking. Maybe the only thing we need to do to get a "smarter" benchmarking tool is to change random environment variable every criterion iteration.

It's more than different values, it needs a different process, IIRC, to get the code loaded in a different position. Which would make benchmarking substantially slower...

2 Likes

Plenary: Performance Matters - Emery Berger - CppCon 2020

Slightly older but essentially the same presentation

They made Stabilizer and Coz (which has Rust support).

8 Likes

Yes. Well done. That is exactly what I was thinking of.

It really does highlight some non-intuitive, non-obvious, behaviors of our computers today.