Rust program has only 42% the speed of similar c++ program

I recently started picking up Rust (like yesterday). I wrote a very simple metropolis algorithm, but when I run it, I get much worse performance, compared to a c++ program which does the exact same thing. I am not able to understand why it is so much slower. I have compiled in rust release mode, but to no avail. I have also tried adding the flag RUSTFLAGS="-C target-cpu=native" before compiling, but it remains as slow.
I am running these benchmarks on a M1 chip running Asahi Linux. Might this be the reason, or have I done anything wrong in my code?

Here is my code: Metropolis benchmark code in rust and c++

These programs are not doing exactly the same. I am pretty sure that the difference comes from using Vec which performs dynamic heap allocations in metropolis_step function. In your C++ version you're just using a dynamic array allocated on the stack. Unfortunately, Rust does not support that. Instead you should preallocate Vec and reuse it for each iteration. Another way is to use something like arrayvec crate.

2 Likes

Thank you so very much, I suspected that it would be either that, or the random number generation. I tried changing the code to use arrayvec. Is it possible to have arrayvec have a size defined at runtime? And is it possible to make the random number generation in a more static way?

There seems to be some significant overhead in index-checking, too. I’ve played around with the code and made it more idiomatic (using iterators, in particular zip and sum, and no more explicit indexing of individual items), which brings it not all the way to the C++, but a bit closer. I suppose the rest might be because of the different RNG implementation.

diff --git a/main.rs b/main.rs
index 4520871..756eb0b 100644
--- a/main.rs
+++ b/main.rs
@@ -1,35 +1,25 @@
 use rand::{distributions::Uniform, Rng};
-use rand::{SeedableRng, rngs::StdRng};
+use rand::{rngs::StdRng, SeedableRng};
 
 use std::time::Instant;
 
 struct WaveFunction {
-    num_particles: usize,
-    dimensions: usize,
     alpha: f64,
 }
 
 impl WaveFunction {
-    pub fn new(num_particles: usize, dimensions: usize, alpha: f64) -> WaveFunction {
+    pub fn new(alpha: f64) -> WaveFunction {
         Self {
-            num_particles,
-            dimensions,
-            alpha
+            alpha,
         }
     }
 
-    fn compute_r_squared(&self, particles: &Vec<f64>) -> f64 {
-        let mut r_squared = 0.;
-
-        for i in 0..self.num_particles*self.dimensions {
-            r_squared += particles[i] * particles[i];
-        }
-
-        r_squared
+    fn compute_r_squared(&self, particles: &[f64]) -> f64 {
+        particles.iter().map(|&p| p*p).sum()
     }
 
-    pub fn evaluate(&self, particles: &Vec<f64>) -> f64 {
-        (- self.alpha * self.compute_r_squared(particles)).exp()
+    pub fn evaluate(&self, particles: &[f64]) -> f64 {
+        (-self.alpha * self.compute_r_squared(particles)).exp()
     }
 }
 
@@ -43,11 +33,16 @@ struct System {
 }
 
 impl System {
-    pub fn new(num_particles: usize, dimensions: usize, alpha: f64, delta_t: f64, seed: u64) -> System {
-        let rng: StdRng = SeedableRng::seed_from_u64(seed);
-
-        let particles = rng
-            .clone()
+    pub fn new(
+        num_particles: usize,
+        dimensions: usize,
+        alpha: f64,
+        delta_t: f64,
+        seed: u64,
+    ) -> System {
+        let mut rng: StdRng = SeedableRng::seed_from_u64(seed);
+
+        let particles = (&mut rng)
             .sample_iter(&Uniform::<f64>::new(-1., 1.))
             .take(num_particles * dimensions)
             .collect();
@@ -57,7 +52,7 @@ impl System {
             particles,
             num_particles,
             dimensions,
-            wave_function: WaveFunction::new(num_particles, dimensions, alpha),
+            wave_function: WaveFunction::new(alpha),
             delta_t,
         }
     }
@@ -72,15 +67,18 @@ impl System {
     fn metropolis_step(&mut self) -> bool {
         let particle = self.rng.gen_range(0..self.num_particles);
 
-        let movement: Vec<f64> = self.rng.clone()
+        let movement: Vec<f64> = (&mut self.rng)
             .sample_iter(&Uniform::<f64>::new(-self.delta_t, self.delta_t))
             .take(self.dimensions)
             .collect();
 
         let wave_before = self.wave_function.evaluate(&self.particles);
 
-        for dim in 0..self.dimensions {
-            self.particles[particle * self.dimensions + dim] += movement[dim];
+        for (p, m) in self.particles[particle * self.dimensions..]
+            .iter_mut()
+            .zip(&movement)
+        {
+            *p += *m;
         }
 
         let wave_after = self.wave_function.evaluate(&self.particles);
@@ -88,11 +86,14 @@ impl System {
         let ratio = (wave_after * wave_after) / (wave_before * wave_before);
 
         if self.rng.gen::<f64>() < ratio {
-            return true
+            return true;
         }
 
-        for dim in 0..self.dimensions {
-            self.particles[particle * self.dimensions + dim] -= movement[dim];
+        for (p, m) in self.particles[particle * self.dimensions..]
+            .iter_mut()
+            .zip(&movement)
+        {
+            *p -= *m;
         }
 
         false
$ cargo run --release
   Compiling benchmark v0.1.0 (/home/frank/forks/just-a-clone/tmp/2283842)
    Finished release [optimized] target(s) in 0.49s
     Running `target/release/benchmark`
Accepted: 809759
115
$ cargo run --release
    Finished release [optimized] target(s) in 0.01s
     Running `target/release/benchmark`
Accepted: 809759
110
$ cargo run --release
    Finished release [optimized] target(s) in 0.01s
     Running `target/release/benchmark`
Accepted: 809759
110
$ cargo run --release
    Finished release [optimized] target(s) in 0.01s
     Running `target/release/benchmark`
Accepted: 809759
110
$ cargo run --release
    Finished release [optimized] target(s) in 0.01s
     Running `target/release/benchmark`
Accepted: 809759
109
$ cargo run --release
    Finished release [optimized] target(s) in 0.01s
     Running `target/release/benchmark`
Accepted: 809759
110
$ cargo run --release
    Finished release [optimized] target(s) in 0.01s
     Running `target/release/benchmark`
Accepted: 809759
110
$ cargo run --release
    Finished release [optimized] target(s) in 0.01s
     Running `target/release/benchmark`
Accepted: 809759
110
$ ./a.out
809554
Time: 95
$ ./a.out
809554
Time: 102
$ ./a.out
809554
Time: 100
$ ./a.out
809554
Time: 97
$ ./a.out
809554
Time: 97
$ ./a.out
809554
Time: 102
$ ./a.out
809554
Time: 101
$ ./a.out
809554
Time: 100

(so roughly 90% the speed now, though they also started out a bit closer together on my machine)

4 Likes

No, internally it uses a fixed size array, so it's the same problem all again. I think you have two possibilities here:

  • use a crate that offers alloca in rust to emulate the C++ code, though with quick search they don't seem that popular
  • use a crate that offers a small vector optimization like tinyvec, which you should probably use anyway since larger allocations should not go on the stack anyway
3 Likes

Thank you so very much for taking the time to do all this! Your code was beautiful! Guess I still have a lot to learn when it comes to Rust:)

Using your optimizations, I was able to get the rust code down to ~105ms, which is great, however, the c++ compiled with g++ -O3 src/main.cpp still runs in ~60 ms. Using flamegraph, I see that 20% (so about 20 ms) is spent in the rng, meaning it does not explain the entire difference.

Is it possible that the problem is mostly that I compile on arm? And maybe as @darksv said, that the array is defined in stack memory for the c++ code?

Edit: Turns out there were two random blocks, each with about 20 ms, which adds up to 40 ms, which is the difference (if we assume that the rng in c++ runs instantly)

Edit 2: When running the c++ code, I am sometimes able to get as low as 40ms. I am using the O3 optimization, so the compiler might do some compiler-optimizations, which rust does not do? Also, the random engine uses only 7% of 40 ms, meaning it runs near instant (in like 2.8ms)

You shouldn't consider the "sometimes" case. You should use a robust measure of central tendency in order to obtain the typical running time, such as the median or the interquartile mean. The extrema (either low or high) shouldn't be considered indicative at all. Also consider using a statistically-sound microbenchmarking framework, e.g. Criterion.

It's hard to tell without looking at the assembly, but in general, Rust is easier to optimize than C++, because its type system guarantees more constraints are upheld, and so it's possible for LLVM to perform some optimizations that it can't do in C++ (google for example "rust llvm noalias"). So no, it's actually the opposite.

You are using a Mersenne-Twister PRNG in the C++ code, which is a specifically fast PRNG (which however doesn't necessarily have very good statistical properties, but it's generally considered good enough for simulations). You could try using the exact same algorithm in Rust, e.g. there's a mersenne_twister crate.

8 Likes

Worth noting you can get O3 on Rust code with

[profile.release]
opt-level = 3

in Cargo.toml (see Profiles - The Cargo Book)

I doubt this will change much, but it's worth checking. Might also be worth testing the C version compiled with clang so as to use LLVM and get almost exactly the same optimization passes just to test whether you're running into a difference between gcc's and LLVM's optimizations. Also unlikely, but it has happened.

I haven't looked at the code, so just offering these as other possibilities to look into if you're interested.

Isn't release already opt level 3? I know rustc -O is actually level 2, but cargo release should be 3 iirc

Oh, that's right. My bad.

I have a couple of suggestions. The first is that when passing references to a Vec it is idiomatic to use &[f64] target than & Vec<f64>. The former is both more efficient (by a negligible amount) and more flexible.

The second suggestion is less certain, but I'd be tempted to make dimension a const generic. This could make your code both cleaner and more efficient, since you're particles could be with as Vec<[f64; DIMENSION]> and you could avoid your manual interesting with stride. It would also allow you to easily express your movement as a [f64; DIMENSION] on the stack. This whole idea is assuming that you are likely to only simulate in 2 or 3 dimensions. It's a bit faster because the function would be monomorphized you generate separate versions for each dimensionality. If you didn't want to mess with genetics, you could also just make DIMENSION be a const, and could potentially use a feature flag to determine 2d versus 3d.

If you really do need to pick arbitrarily large dimensions at runtime then I'd not change the implementation of dimension or positions, but would reuse a Vec for your movement.

1 Like

In the author's case, dynamic dimensions are likely necessary; otherwise, an array or ArrayVec would have sufficed, as you mentioned. Keep in mind that all features supported by Rust's const generics are also supported by C++'s templates.

3 Likes

I'll point out that although the Mersenne Twister is far compared to cryptographically strong random number generators, it is slow when compared to generators also designed for simulations which produce better random numbers.

I would suggest the SmallRng which currently uses Xoshiro256PlusPlus, and then use the same algorithm in C++, which will improve both the speed and accuracy of both simulations.

3 Likes

There is a technique of passing a stack based primitive array into the constructor. Usually the downside of this technique is wasting unused space. But if the simulation takes up all of the array elements offered, then it would for practical purpose be dynamic sized stack arrays.

I first learned of this technique from the "stackfmt" crate.

pub struct WriteTo<'a> {
    buffer: &'a mut [u8],
    used: usize,    // Position inside buffer where the written string ends
    overflow: bool, // If formatted data was truncated
}

// Construction and string access
impl<'a> WriteTo<'a> {
    /// Creates new stream.
    pub fn new(buffer: &'a mut [u8]) -> Self {
        WriteTo {
            buffer,
            used: 0,
            overflow: false,
        }
    }
... (omitted)
1 Like

The slice has, historically, also been easier for LLVM to understand, so can sometimes be non-negligibly faster if it allows LLVM to remove a bunch of bounds checks: We all know `iter` is faster than `loop`, but why? - #3 by scottmcm

4 Likes

IMO
In case of knowing C++ in expert level, nth. Rust can't be faster than C++.
It's not so correct to compare C/C++ with Rust.
The main concept of rust is safety based on hard rules/checks + have similar low level access to system resources like in case of C/C++.

That's not really true. Or at least, putting it this way seems dishonest to me. It suggests that Rust is at most as fast as, or slower than, C++.

However, since the low-level memory models of the two languages are very similar, there's nothing really fundamental that would ascribe a general performance difference between the two. Both Rust and C++ give the programmer control over allocations, layout, pointers, and feature optional reference counting, to mention a few.

It is true that due to some constraints/guarantees in the type system, some Rust code can be optimized better or easier than the equivalent C++ code. On the other hand, it is also true that there are perhaps more mature/more specific C++ optimizer implementations (it used to be the case for several years that C compiled with GCC consistently outperformed C compiled with Clang+LLVM) and libraries that have been around for a longer time than their Rust equivalents.

However, these differences can usually be observed in microbenchmarks, and there is more to performance than microbenchmarks. Furthermore, I wouldn't say that either Rust or C++ is intrinsically predestined to be slower than the other – the differences are mostly technical, some of them are temporary, and low-level optimizers/code generators are perpetually playing a cat-and-mouse game in terms of benchmark results.

19 Likes

Moreover, the strict aliasing rules mean Rust may be faster than C++, even faster a lot in some code pieces.

However, generally, they're equivalent.

1 Like
struct System2<const PAR:usize, const DIM:usize, const PXD:usize> {
    my_particles: [f64; PXD],
    my_neg_alpha: f64,
    my_metro_distrib: Uniform<usize>,
    my_delta_distrib: Uniform<f64>,
    my_ratio_distrib: Standard,
}

impl<const PAR:usize, const DIM:usize, const PXD:usize> System2<PAR,DIM,PXD> {

    pub fn new(rng: & mut StdRng, alpha: f64, delta_t: f64) ->
    System2<PAR,DIM,PXD> {
        let mut tmp_box = [0.0f64; PXD];
        let particle_range = Uniform::<f64>::new(-1., 1.);
        for indx in 0 .. PXD {
            tmp_box[indx] = rng.sample(& particle_range);
        }
        System2 {
            my_particles: tmp_box,
            my_neg_alpha: - alpha,
            my_metro_distrib: Uniform::<usize>::new(0usize, PAR),
            my_delta_distrib: Uniform::<f64>::new(- delta_t, delta_t),
            my_ratio_distrib: Standard,
        }
    }

    #[inline]
    fn r_squared(&self) -> f64 {
        let mut r_squared:f64 = 0.;
        for indx in 0 .. PXD {
            let v = self.my_particles[indx];
            r_squared += v * v;
        }
        r_squared
    }

    #[inline]
    pub fn evaluate_with_r_squared(&self, r_sq: f64) -> f64 {
        (self.my_neg_alpha * r_sq).exp()
    }

    fn metropolis_step(&mut self, rng: & mut StdRng) -> usize {
        let particle:usize = rng.sample(& self.my_metro_distrib);
        let mut movement_box = [0.0f64; DIM];
        for indx in 0 .. DIM {
            movement_box[indx] = rng.sample(& self.my_delta_distrib);
        }
        let r_sq_before:f64 = self.r_squared();
        let wave_before = self.evaluate_with_r_squared(r_sq_before);
        let wave_before2 = wave_before * wave_before;
        for dim in 0 .. DIM {
            self.my_particles[particle * DIM + dim] += movement_box[dim];
        }
        let r_sq_after:f64 = self.r_squared();
        let wave_after = self.evaluate_with_r_squared(r_sq_after);
        let ratio = (wave_after * wave_after) / wave_before2;
        let roll:f64 = rng.sample(& self.my_ratio_distrib);
        if roll < ratio {
            1
        }
        else {
            for dim in 0 .. DIM {
                self.my_particles[particle * DIM + dim] -= movement_box[dim];
            }
            0
        }
    }

    pub fn run(&mut self, metropolis_steps: usize, rng: & mut StdRng) {
        let mut accepted_steps:usize = 0;
        for _ in 0 .. metropolis_steps {
            accepted_steps += self.metropolis_step(rng);
        }
        println!("Accepted: {}", accepted_steps);
    }
}
// let mut system2 = System2::<5,2,10>::new(& mut rng2, alpha, delta_t);

I have 118 ms (old) vs 51ms (new).

Using const generic to enable loop unrolling. Eliminated some RNG cloning.

6 Likes

there is also this crate GitHub - artichoke/rand_mt: 🌪 Mersenne Twister implementation backed by rand_core