Rustc does not detect loop invariant

Hey guys,

a colleague wanted to compare the performance of rustc with clang. He created a small example in C

#include <inttypes.h>
#include <stdio.h>
static inline int64_t f(const int64_t x) { return x * x * x * x * x; }
int main(void) {
  const int64_t s = 300;
  for (int64_t x1 = 1; x1 <= s; x1++)
    for (int64_t x2 = x1; x2 <= s; x2++)
      for (int64_t x3 = x2; x3 <= s; x3++)
        for (int64_t x4 = x3; x4 <= s; x4++)
          for (int64_t x5 = x4; x5 <= s; x5++)
            if (f(x1) + f(x2) + f(x3) + f(x4) == f(x5))
              printf("%" PRId64 " %" PRId64 " %" PRId64 " %" PRId64 " %" PRId64 "\n",
                     x1, x2, x3, x4, x5);
  return 0;
}

and the (hopefully) equivalent in rust (I know println! is different, but its only called twice)

fn f(x: i64) -> i64 {
    x * x * x * x * x
}

pub fn main() {
    const S: i64 = 300;
    for x1 in 1..=S {
        for x2 in x1..=S {
            for x3 in x2..=S {
                for x4 in x3..=S {
                    for x5 in x4..=S {
                        if f(x1) + f(x2) + f(x3) + f(x4) == f(x5) {
                            println!("{} {} {} {} {}", x1, x2, x3, x4, x5);
                        }
                    }
                }
            }
        }
    }
}

With -O3 for clang and --release for cargo, the rust example takes almost 5 times as long as the C example (49.13 s vs 10.56 s, timed with hyperfine). When looking at the disassembly it is obvious why.

While clang notices that f(x1) (and so on) only needs to be calculated once per outer loop, rustc does not and recalculates f(x1) (and so on) in the innermost loop. What am I missing here? It seems like a quite obvious optimization.

Thanks,
Florian

4 Likes

Replacing x..=S with x..S+1 should be more like the C code example, which would overflow when S is i64::MAX, and thus can be implemented more efficiently – or put differently, the iterator code for inclusive ranges has some overhead from handling this special case correctly, and that overhead might not be able to be optimized away entirely, even when the upper bound is a constant like it is here. Maybe that’s all that’s needed to bring the rust code up to speed?


On second read, I’m noticing the question specifically mentions re-calculation of f(x1) which isn’t an obvious consequence of the above; anyways, I was just pointing out significant structural difference between the Rust version and the C version, i.e. I’m just answering to the fact that the code was called “equivalent” which it isn’t as much as it could be.

2 Likes

This doesn't answer the question of loop invariants though - just checked on the playground, all imuls are still in the innermost loop. Or do you expect this difference to be negligible?

Thanks for the replies. I already tried x..S+1, which lead to the same results unfortunately.

Are you sure a couple of multiplications are gonna matter that much? In my opinion any difference between println and printf is likely orders of magnitude bigger.

1 Like

I think so, println is only called 2 times, which can't explain a 40 s difference, right?

1 Like

Oops totally missed that part sorry

1 Like

While the println! is not obvious, it could be that it's removing the ability for the compiler to trigger the optimizations?

Replacing println! with printf causes the c version and the rust-printf version to take around the same time, where as the println! version takes a LOT longer (on my laptop with a AMD Ryzen 7 PRO 4750U).

rust-printf

EDIT: adding a println-scoped version based on tczajka's answer below, which puts a println solution in line with the c and rust-printf version

$ hyperfine ./rust-println-scoped ./rust-printf ./rust-println ./a.out
Benchmark 1: ./rust-println-scoped
  Time (mean ± σ):     16.385 s ±  0.071 s    [User: 16.347 s, System: 0.003 s]
  Range (min … max):   16.239 s … 16.478 s    10 runs

Benchmark 2: ./rust-printf
  Time (mean ± σ):     16.535 s ±  0.074 s    [User: 16.499 s, System: 0.001 s]
  Range (min … max):   16.465 s … 16.720 s    10 runs

Benchmark 3: ./rust-println
  Time (mean ± σ):     77.834 s ±  0.262 s    [User: 77.626 s, System: 0.025 s]
  Range (min … max):   77.582 s … 78.269 s    10 runs

Benchmark 4: ./a.out
  Time (mean ± σ):     16.623 s ±  0.349 s    [User: 16.572 s, System: 0.002 s]
  Range (min … max):   16.332 s … 17.536 s    10 runs

Summary
  ./rust-println-scoped ran
    1.01 ± 0.01 times faster than ./rust-printf
    1.01 ± 0.02 times faster than ./a.out
    4.75 ± 0.03 times faster than ./rust-println

(./a.out being the c version)
all rust versions compiled with cargo build --release (rustc 1.72.1), and c version with gcc main.c -O3 (gcc 13.2.1).
I'm not experienced enough to comment on whether it's done the same optimizations as the c version.

5 Likes

I remember that println was interpreted as a pointer escape at some point. I thought that was fixed but apparently it isnt.

1 Like

Here is a way to make the code 5x faster -- replace this line:

println!("{} {} {} {} {}", x1, x2, x3, x4, x5);

with:

println!("{} {} {} {} {}", {x1}, {x2}, {x3}, {x4}, {x5});

The compiler is worried that the variables may get changed through immutable references passed to Display implementations, making the f calls not invariant across loop iterations. That is unfortunate, given that immutability is such a core language feature.

52 Likes

> Boss, I made this code 400% faster!
> Great, you get a raise!
> …
> Just curious, how'd you do that?
> {}{}{}{}{}

21 Likes

Is this a bug in println! that should be reported? Or is this intended behavior?

1 Like

Definitely unintended. I think this is a case of #91010

18 Likes

How do the single expression blocks in the replacement line change the compiler’s assumptions?

2 Likes

Blocks return values. They can never act as places, so when the formatting machinery takes a reference, &{x1}, it's referring to a temporary that holds a copy of the value of x1, not x1 itself.

10 Likes

Would it be theoretically possible for the compiler to infer if interior mutability of an immutably referenced value is even possible?

(Either by inspecting the type of the referenced value or by statically analyzing code paths the reference takes?)

Yes, absolutely. The compiler is well aware that i64 is not interior-mutable — it's just evidently not using that information in codegen. That's why @zirconium-n linked a bug.

8 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.