Rust program has only 42% the speed of similar c++ program

Worth noting you can get O3 on Rust code with

opt-level = 3

in Cargo.toml (see Profiles - The Cargo Book)

I doubt this will change much, but it's worth checking. Might also be worth testing the C version compiled with clang so as to use LLVM and get almost exactly the same optimization passes just to test whether you're running into a difference between gcc's and LLVM's optimizations. Also unlikely, but it has happened.

I haven't looked at the code, so just offering these as other possibilities to look into if you're interested.

Isn't release already opt level 3? I know rustc -O is actually level 2, but cargo release should be 3 iirc

Oh, that's right. My bad.

I have a couple of suggestions. The first is that when passing references to a Vec it is idiomatic to use &[f64] target than & Vec<f64>. The former is both more efficient (by a negligible amount) and more flexible.

The second suggestion is less certain, but I'd be tempted to make dimension a const generic. This could make your code both cleaner and more efficient, since you're particles could be with as Vec<[f64; DIMENSION]> and you could avoid your manual interesting with stride. It would also allow you to easily express your movement as a [f64; DIMENSION] on the stack. This whole idea is assuming that you are likely to only simulate in 2 or 3 dimensions. It's a bit faster because the function would be monomorphized you generate separate versions for each dimensionality. If you didn't want to mess with genetics, you could also just make DIMENSION be a const, and could potentially use a feature flag to determine 2d versus 3d.

If you really do need to pick arbitrarily large dimensions at runtime then I'd not change the implementation of dimension or positions, but would reuse a Vec for your movement.

1 Like

In the author's case, dynamic dimensions are likely necessary; otherwise, an array or ArrayVec would have sufficed, as you mentioned. Keep in mind that all features supported by Rust's const generics are also supported by C++'s templates.


I'll point out that although the Mersenne Twister is far compared to cryptographically strong random number generators, it is slow when compared to generators also designed for simulations which produce better random numbers.

I would suggest the SmallRng which currently uses Xoshiro256PlusPlus, and then use the same algorithm in C++, which will improve both the speed and accuracy of both simulations.


There is a technique of passing a stack based primitive array into the constructor. Usually the downside of this technique is wasting unused space. But if the simulation takes up all of the array elements offered, then it would for practical purpose be dynamic sized stack arrays.

I first learned of this technique from the "stackfmt" crate.

pub struct WriteTo<'a> {
    buffer: &'a mut [u8],
    used: usize,    // Position inside buffer where the written string ends
    overflow: bool, // If formatted data was truncated

// Construction and string access
impl<'a> WriteTo<'a> {
    /// Creates new stream.
    pub fn new(buffer: &'a mut [u8]) -> Self {
        WriteTo {
            used: 0,
            overflow: false,
... (omitted)
1 Like

The slice has, historically, also been easier for LLVM to understand, so can sometimes be non-negligibly faster if it allows LLVM to remove a bunch of bounds checks: We all know `iter` is faster than `loop`, but why? - #3 by scottmcm


In case of knowing C++ in expert level, nth. Rust can't be faster than C++.
It's not so correct to compare C/C++ with Rust.
The main concept of rust is safety based on hard rules/checks + have similar low level access to system resources like in case of C/C++.

That's not really true. Or at least, putting it this way seems dishonest to me. It suggests that Rust is at most as fast as, or slower than, C++.

However, since the low-level memory models of the two languages are very similar, there's nothing really fundamental that would ascribe a general performance difference between the two. Both Rust and C++ give the programmer control over allocations, layout, pointers, and feature optional reference counting, to mention a few.

It is true that due to some constraints/guarantees in the type system, some Rust code can be optimized better or easier than the equivalent C++ code. On the other hand, it is also true that there are perhaps more mature/more specific C++ optimizer implementations (it used to be the case for several years that C compiled with GCC consistently outperformed C compiled with Clang+LLVM) and libraries that have been around for a longer time than their Rust equivalents.

However, these differences can usually be observed in microbenchmarks, and there is more to performance than microbenchmarks. Furthermore, I wouldn't say that either Rust or C++ is intrinsically predestined to be slower than the other – the differences are mostly technical, some of them are temporary, and low-level optimizers/code generators are perpetually playing a cat-and-mouse game in terms of benchmark results.


Moreover, the strict aliasing rules mean Rust may be faster than C++, even faster a lot in some code pieces.

However, generally, they're equivalent.

1 Like
struct System2<const PAR:usize, const DIM:usize, const PXD:usize> {
    my_particles: [f64; PXD],
    my_neg_alpha: f64,
    my_metro_distrib: Uniform<usize>,
    my_delta_distrib: Uniform<f64>,
    my_ratio_distrib: Standard,

impl<const PAR:usize, const DIM:usize, const PXD:usize> System2<PAR,DIM,PXD> {

    pub fn new(rng: & mut StdRng, alpha: f64, delta_t: f64) ->
    System2<PAR,DIM,PXD> {
        let mut tmp_box = [0.0f64; PXD];
        let particle_range = Uniform::<f64>::new(-1., 1.);
        for indx in 0 .. PXD {
            tmp_box[indx] = rng.sample(& particle_range);
        System2 {
            my_particles: tmp_box,
            my_neg_alpha: - alpha,
            my_metro_distrib: Uniform::<usize>::new(0usize, PAR),
            my_delta_distrib: Uniform::<f64>::new(- delta_t, delta_t),
            my_ratio_distrib: Standard,

    fn r_squared(&self) -> f64 {
        let mut r_squared:f64 = 0.;
        for indx in 0 .. PXD {
            let v = self.my_particles[indx];
            r_squared += v * v;

    pub fn evaluate_with_r_squared(&self, r_sq: f64) -> f64 {
        (self.my_neg_alpha * r_sq).exp()

    fn metropolis_step(&mut self, rng: & mut StdRng) -> usize {
        let particle:usize = rng.sample(& self.my_metro_distrib);
        let mut movement_box = [0.0f64; DIM];
        for indx in 0 .. DIM {
            movement_box[indx] = rng.sample(& self.my_delta_distrib);
        let r_sq_before:f64 = self.r_squared();
        let wave_before = self.evaluate_with_r_squared(r_sq_before);
        let wave_before2 = wave_before * wave_before;
        for dim in 0 .. DIM {
            self.my_particles[particle * DIM + dim] += movement_box[dim];
        let r_sq_after:f64 = self.r_squared();
        let wave_after = self.evaluate_with_r_squared(r_sq_after);
        let ratio = (wave_after * wave_after) / wave_before2;
        let roll:f64 = rng.sample(& self.my_ratio_distrib);
        if roll < ratio {
        else {
            for dim in 0 .. DIM {
                self.my_particles[particle * DIM + dim] -= movement_box[dim];

    pub fn run(&mut self, metropolis_steps: usize, rng: & mut StdRng) {
        let mut accepted_steps:usize = 0;
        for _ in 0 .. metropolis_steps {
            accepted_steps += self.metropolis_step(rng);
        println!("Accepted: {}", accepted_steps);
// let mut system2 = System2::<5,2,10>::new(& mut rng2, alpha, delta_t);

I have 118 ms (old) vs 51ms (new).

Using const generic to enable loop unrolling. Eliminated some RNG cloning.


there is also this crate GitHub - artichoke/rand_mt: 🌪 Mersenne Twister implementation backed by rand_core

Another thing you can do to go for maximum performance on the resulting binary is placing the following release profile in your cargo.toml:

opt-level = 3 # Sets optimizing level to maximum performance with no regard for binary size
codegen-units = 1 # Instead of spreading code generation over multiple units only use one unit which can lead to better performance at the cost of compilation time
debug = false # No debug symbols
debug-assertions = false
lto = "fat" # Link time optimizations across all dependencies at the cost of higher compilation time
rpath = false
panic = "abort"

Combining this with the following command to compile always provides me with the highest performance I can get from my applications:

RUSTFLAGS="-Ctarget-cpu=native" cargo build --release --target=<your_target_triple i.e. x86_64_unknown-linux-gnu>

Though I can only advice you to play around with these settings and their values to find the optimum for your case.

Thank you!

Although this library only seems to generate unsigned integers.

You can enable the rand-traits feature for the types to implement RngCore, then use Rng::gen() to extract arbitrary integers.


The first is that when passing references to a Vec it is idiomatic to use &[f64] target than & Vec<f64> . The former is both more efficient (by a negligible amount) and more flexible.

In my first proper Rust program (which was a rewrite of C# and Python versions) I initially passed &Vec, as my goal was initially just to get a correctly working program. But I remembered seeing the idiomatic &[].

When I substituted that on my second pass I had a significant performance speedup. The timing went from 0.36s to 0.28s.


Thank you, but I must confess I do not really understand how I would do this. Could you please provide a small example snippet?

The traits are designed to be very simple to use.

/* in Cargo.toml:
rand = "0.8.5"
rand_mt = "4.1.1"

use rand::{Rng, SeedableRng};
use rand_mt::Mt64;

fn main() {
    let mut rng = Mt64::from_entropy();
    for _ in 0..10 {
        println!("{}", rng.gen_range(-1.0..=1.0));

(Alternatively, any of the other Rng methods can be used to generate values.)


This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.