a colleague wanted to compare the performance of rustc with clang. He created a small example in C
#include <inttypes.h>
#include <stdio.h>
static inline int64_t f(const int64_t x) { return x * x * x * x * x; }
int main(void) {
const int64_t s = 300;
for (int64_t x1 = 1; x1 <= s; x1++)
for (int64_t x2 = x1; x2 <= s; x2++)
for (int64_t x3 = x2; x3 <= s; x3++)
for (int64_t x4 = x3; x4 <= s; x4++)
for (int64_t x5 = x4; x5 <= s; x5++)
if (f(x1) + f(x2) + f(x3) + f(x4) == f(x5))
printf("%" PRId64 " %" PRId64 " %" PRId64 " %" PRId64 " %" PRId64 "\n",
x1, x2, x3, x4, x5);
return 0;
}
and the (hopefully) equivalent in rust (I know println! is different, but its only called twice)
fn f(x: i64) -> i64 {
x * x * x * x * x
}
pub fn main() {
const S: i64 = 300;
for x1 in 1..=S {
for x2 in x1..=S {
for x3 in x2..=S {
for x4 in x3..=S {
for x5 in x4..=S {
if f(x1) + f(x2) + f(x3) + f(x4) == f(x5) {
println!("{} {} {} {} {}", x1, x2, x3, x4, x5);
}
}
}
}
}
}
}
With -O3 for clang and --release for cargo, the rust example takes almost 5 times as long as the C example (49.13 s vs 10.56 s, timed with hyperfine). When looking at the disassembly it is obvious why.
While clang notices that f(x1) (and so on) only needs to be calculated once per outer loop, rustc does not and recalculates f(x1) (and so on) in the innermost loop. What am I missing here? It seems like a quite obvious optimization.
Replacing x..=S with x..S+1 should be more like the C code example, which would overflow when S is i64::MAX, and thus can be implemented more efficiently – or put differently, the iterator code for inclusive ranges has some overhead from handling this special case correctly, and that overhead might not be able to be optimized away entirely, even when the upper bound is a constant like it is here. Maybe that’s all that’s needed to bring the rust code up to speed?
On second read, I’m noticing the question specifically mentions re-calculation of f(x1) which isn’t an obvious consequence of the above; anyways, I was just pointing out significant structural difference between the Rust version and the C version, i.e. I’m just answering to the fact that the code was called “equivalent” which it isn’t as much as it could be.
This doesn't answer the question of loop invariants though - just checked on the playground, all imuls are still in the innermost loop. Or do you expect this difference to be negligible?
Are you sure a couple of multiplications are gonna matter that much? In my opinion any difference between println and printf is likely orders of magnitude bigger.
While the println! is not obvious, it could be that it's removing the ability for the compiler to trigger the optimizations?
Replacing println! with printf causes the c version and the rust-printf version to take around the same time, where as the println! version takes a LOT longer (on my laptop with a AMD Ryzen 7 PRO 4750U).
EDIT: adding a println-scoped version based on tczajka's answer below, which puts a println solution in line with the c and rust-printf version
$ hyperfine ./rust-println-scoped ./rust-printf ./rust-println ./a.out
Benchmark 1: ./rust-println-scoped
Time (mean ± σ): 16.385 s ± 0.071 s [User: 16.347 s, System: 0.003 s]
Range (min … max): 16.239 s … 16.478 s 10 runs
Benchmark 2: ./rust-printf
Time (mean ± σ): 16.535 s ± 0.074 s [User: 16.499 s, System: 0.001 s]
Range (min … max): 16.465 s … 16.720 s 10 runs
Benchmark 3: ./rust-println
Time (mean ± σ): 77.834 s ± 0.262 s [User: 77.626 s, System: 0.025 s]
Range (min … max): 77.582 s … 78.269 s 10 runs
Benchmark 4: ./a.out
Time (mean ± σ): 16.623 s ± 0.349 s [User: 16.572 s, System: 0.002 s]
Range (min … max): 16.332 s … 17.536 s 10 runs
Summary
./rust-println-scoped ran
1.01 ± 0.01 times faster than ./rust-printf
1.01 ± 0.02 times faster than ./a.out
4.75 ± 0.03 times faster than ./rust-println
(./a.out being the c version)
all rust versions compiled with cargo build --release (rustc 1.72.1), and c version with gcc main.c -O3 (gcc 13.2.1).
I'm not experienced enough to comment on whether it's done the same optimizations as the c version.
The compiler is worried that the variables may get changed through immutable references passed to Display implementations, making the f calls not invariant across loop iterations. That is unfortunate, given that immutability is such a core language feature.
Blocks return values. They can never act as places, so when the formatting machinery takes a reference, &{x1}, it's referring to a temporary that holds a copy of the value of x1, not x1 itself.
Yes, absolutely. The compiler is well aware that i64 is not interior-mutable — it's just evidently not using that information in codegen. That's why @zirconium-n linked a bug.