Not optimized properly?

I was just messing around to see which is optimized more and saw the following results.

Overall
C++ with O3 6037338
C++ with Os 11049853
Rust with alias 11361628
Rust with noalias 11535222

Average
C++ with O3 6037.338
C++ with Os 11049.853
Rust with alias 11361.628
Rust with noalias 11535.222

Standard deviation
C++ with O3 7285.095556803356
C++ with Os 6220.168290439657
Rust with alias 14224.509900436498
Rust with noalias 7729.57509380147

Rust code
use std::{io, time::Instant};
fn main() {
    let mut s = String::new();
    io::stdin().read_line(&mut s).expect("Parsing error");
    let inp = s.trim().parse::<i32>().unwrap();
    let mut a: [i32; 100_000] = [0; 100_000];
    let mut b: [i32; 100_000] = [0; 100_000];
    let mut c: [i32; 100_000] = [0; 100_000];
    let n = Instant::now();
    for i in 0..inp {
        a[i as usize] = i;
    }
    for i in 0..inp {
        b[i as usize] = i;
    }
    for i in 0..inp {
        c[i as usize] = a[i as usize] + b[i as usize];
    }
    let now = n.elapsed();
    println!("{:?} {}", now.as_nanos(), c[inp as usize- 1]);
}

Is there some reason why the alias version is having the same or similar level of speed as non-alias version? Plus C++ being much faster without any aliasing disabled? The program is a 100000 integer array and I used the same code in Rust and C++ too, so at least in Rust no alias should be somewhat faster?

We can't really say anything without the code

But a few thoughts:

C++ has type base aliasing analysis, so it's not technically correct to say "no aliasing disabled"

Whether noalias has any impact at all really depends on your program (which again we have no idea what yours is)

Did you use cargo run/build --release?

With no code whatsoever, it's impossible for anyone to say. My best guess would be that your two programs are not actually doing the same thing.

1 Like

I used cargo run --release plus no aliasing enabled for rustnoalias and cargo --release for rustalias

1 Like

inp may be bigger than 100_000, meaning those array accesses have to be bound checked. Bound checks prevent auto vectorization, which may significantly speed up things. Auto vectorization may also be disabled in -Os, so that may explain why C++ with -Os performs like Rust.

Usually performance issues like this can be solved in 2 ways:

  • By moving the assertion before the loop and relying on LLVM to remove the bound check from the loop body. This approach however is not always the best and often can fail to trigger the optimization, so I wouldn't rely on it. In fact I would expect LLVM to be able to optimize the bound checks in this code, but it doesn't.
    let inp = inp as usize;
    assert!(inp <= 100_000);
    for i in 0..inp {
        a[i] = i as i32;
    }
    for i in 0..inp {
        b[i] = i as i32;
    }
    for i in 0..inp {
        c[i] = a[i] + b[i];
    }
  • By using iterators. This is the most idiomatic way and often they optimize better than plain indexing. Here I use for-loops because it doesn't make a difference, but in general internal iteration (e.g. for_each and fold) can often be optimized even better.
    let inp = inp as usize;
    for (i, ai) in a[0..inp].iter_mut().enumerate() {
        *ai = i as i32;
    }
    for (i, bi) in b[0..inp].iter_mut().enumerate() {
        *bi = i as i32;
    }
    for ((ci, ai), bi) in c[0..inp].iter_mut().zip(&a).zip(&b) {
        *ci = *ai + *bi;
    }
4 Likes

Don't use [] in Rust. It's generally slow. Use iterators instead.

  • [i] in Rust is equivalent of vector.at(i) in C++.
  • .get_unchecked(i) is equivalent of [i] in C++.
1 Like

This code with the same benchmarks got posted to /r/cpp before it was removed if you want to have a compare. The reserves where recommended to speed it up.

#include
#include
#include
#include<unistd.h>
using namespace std;

int main() {
vector a, b, c;
cin >> n
//int n = 1000000000;
//a.reserve(n);
//b.reserve(n);
//c.reserve(n);
auto start = chrono::steady_clock::now();
for(int i=0;i<n;i++) {
a.push_back(i);
}
for(int i=0;i<n;i++) {
b.push_back(i);
}
for(int i=0;i<n;i++) {
c.push_back(a[i]+b[i]);
}
auto end = chrono::steady_clock::now();
cout << chrono::duration_cast<chrono::milliseconds>(end-start).count()
<<" ";
cout << c[n-1] << "\n";
return 0;
}

This was a rust version of the code someone said should be fast and is pretty standard code.

use std::time::{Duration, Instant};

fn main() {
    let now = Instant::now();
    let n = 1000000000;
    let a: Vec<_> = (0..n).collect();
    let b: Vec<_> = (0..n).collect();
    let c: Vec<_> = a.iter().zip(b.iter()).map(|(a, b)| *a + *b).collect();
    println!("{}", now.elapsed().as_millis());
    println!("{:?}",c.len());
}

On my machines the rust code is 2.2 seconds and the cpp code ( using -O2) takes 6.7 seconds where both are hardcoded with an input of 1000000000 and the cpp one reserves the vectors.

I also tried switching on noalias with this but its slower.

RUSTFLAGS="-Z mutable-noalias=yes" cargo +nightly run --release

Is there any reason why it is slower when no alias is used? I thought it should be a tiny bit faster, at least because it doesn't have extra checks.

Not 100% sure why but I'm getting this.

Cargo run --release = 2467ms
cargo +nightly run --release = 2372ms
RUSTFLAGS="-Z mutable-noalias=yes" cargo +nightly run --release = 2504ms