Rust not optimizing out spawning ThreadRng?

Here is some code : (add rand = "0.8.4" in cargo.toml)

fn work() -> i32 {
    let mut a = 0;
    let mut b = 23;
    for _ in 0..100_000_000 {
        a = ((b >> 3) + 17 * b) % 2_000_000_000;
        b = ((a >> 5) + 171) % 2_000_000_000;
    }
    b
}

fn main() {
    let now = std::time::Instant::now();
    let result = work(); // the important line
    println!("{:?}", now.elapsed());
}

This code contains a line which calls a function that is doing a lot of work. But the result is never used. So, as expected, rust optimizes it away when using "cargo run --release" and the result is : 400ns

So intutively, I thought it would also optimize this next line of code the same way :

fn main() {
    let now = std::time::Instant::now();
    let mut rng = rand::thread_rng(); // the important line
    println!("{:?}", now.elapsed());
}

But it does not. It prints : 1.8369ms

Here is a case where it is spending 4 seconds doing nothing :

use rand::Rng;
fn main() {
    let now = std::time::Instant::now();
    let mut sum = 0;
    for _ in 0..1_000_000_000 {
        let mut rng = rand::thread_rng();
        let n: u8 = rng.gen();
        sum += n as i32;
    }
    println!("{:?}", now.elapsed());
}

Result : 4.5228403s
Note : I tried to print the result to see if the rust compiler is optimizing out some things but no, the result is the same.
(Side note : this is absurd speed for making the sum of one billion random numbers !)

Is this intended or rust is just unable to optimize it away ? Am I missing anything ?

Since the thread_rng() function is lazily initializing a thread-local (i.e., effectively global) variable, and the act of generating random numbers changes that (effectively global) state, the compiler can't possibly be allowed to optimize it away.

For example, if this were a predictable counter instead of an RNG, then not calling the function out of a desire to "optimize" would affect the behavior of the function elsewhere: it would return different values. It would likewise affect the behavior of the RNG too, but for random numbers, it doesn't usually matter in practice. (It would still matter if one cared about reproducibility.) However, the compiler doesn't know that you are generating random numbers; for all it knows, the gen() function is invoking a perfectly deterministic algorithm involving a large amount of exact bit shuffling.

Do you mean that it is too slow?

2 Likes

You sure?

I ran your Rust snippet on my computer...

$ cat src/main.rs
use rand::Rng;
fn main() {
    let now = std::time::Instant::now();
    let mut sum = 0;
    for _ in 0..1_000_000_000 {
        let mut rng = rand::thread_rng();
        let n: u8 = rng.gen();
        sum += n as i32;
    }
    println!("{:?}", now.elapsed());
}

$ cargo run --release
2.721087246s

... and compared it to some C++ I hacked together...

$ cat main.cpp
#include <chrono>
#include <cstdint>
#include <cstdlib>
#include <ctime>
#include <iostream>

using namespace std::chrono;

int main() {
  srand((unsigned)time(0));
  auto start = high_resolution_clock::now();
  uint64_t sum = 0;

  for (int i = 0; i < 1 * 1000 * 1000 * 1000; i++) {
    sum += (rand() % 6) + 1;
  }

  auto finish = high_resolution_clock::now();
  auto delta = finish - start;
  auto seconds = duration_cast<duration<float>>(delta);
  std::cout << "Elapsed time: " << seconds.count() << " seconds\n";
}

$ clang++ -O3 main.cpp -o main

$ ./main
Elapsed time: 3.72771 seconds

That experiment shows Rust is about 27% faster.

I think the real problem isn't that adding a billion numbers is slow, it's that we need to generate each one of those billion numbers.

fwiw, I moved the std::thread_rng() call to before we start timing and the duration dropped from 2.7 seconds to about 1.3.

1 Like

Note that in addition to ThreadRng using a global state under the hood, re-seeding the underlying PRNG involves interacting with OS (e.g. by using the getrandom syscall on Linux). This interaction is considered "observable" by compiler, so it can not remove it, regardless whether its results are used or not.

In contrast, the following code gets optimized as you expect:

use rand::{Rng, SeedableRng};

fn main() {
    let now = std::time::Instant::now();
    let mut sum = 0;
    let mut rng = rand::rngs::SmallRng::seed_from_u64(42);
    for _ in 0..1_000_000_000 {
        let n: u8 = rng.gen();
        sum += n as i32;
    }
    println!("{:?}", now.elapsed());
}

Note that replacing SmallRng with StdRng (which is ChaCha12 under the hood) results in execution time bigger than one second. It's because the ChaCha implementation uses CPU feature detection and updates a global atomic variable on first call. Using RUSTFLAGS="-C target-cpu=native" allows compiler to properly perform pure code elimination.

9 Likes

Thank you for the helpful explanations !

No I meant that it's suprisingly fast. I didn't think it would be possible to generate and sum 1 billion numbers that fast. But as Michael-F-Bryan showed, this is the expected speed for low level programming languages (I'm talking about C, C++ and Rust).

Well, let's do a back-of-the-envelope calculation. A modern mid-range PC likely has a superscalar, pipelined, prefetching CPU with raw clock frequency in the range of 1…3 GHz. If we assume a vector width of 4, then a 2 GHz CPU could sum up around 8 billion numbers in a second assuming 1 cycle per instruction and pre-fetching smart enough to cancel out the (enormous) slowdown that would be caused by memory access. When you consider that one invocation of the PRNG is basically just bit shuffling for the most part, then an overall ~32 instruction per iteration seems entirely reasonable.

4 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.