Comparing Raw Numbers for bulk addition in a vector (960000000 numbers)

I will start by saying that Idk a whole lot about parallelism/Rust/C++.
I wrote a couple of toy programs to test raw numbers for mass addition in Rust and C++:
C++:

vector<long long> x(960000000);
std::atomic<long long> res = 0ll;

void methodA() {
    res += accumulate(x.begin(), x.end(), 0ll);
    cout << res << '\n';
}

void add(long long a, long long b, const vector<long long> &x) {
    res += accumulate(x.begin() + a, x.begin() + b, 0ll) << '\n';
}

void methodC() {
    res = 0ll;
    long long d = 960000000 / 16; // (16 == thread count == num_cpus::get())
    vector<thread> threads(8);
    for (long long i = 0ll; i < 8; i++)
        threads[i] = thread(add, i * d, (i + 1) * d, std::ref(x));
    for (int i = 0; i < 8; i++)
        threads[i].join();
}

int main() {
    iota(x.begin(), x.end(), 1ll);
    auto start = std::chrono::high_resolution_clock::now();
    methodA();
    auto end = std::chrono::high_resolution_clock::now();
    auto time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    cout << "Method A took " << time << "ms" << endl;

    start = std::chrono::high_resolution_clock::now();
    methodC();
    end = std::chrono::high_resolution_clock::now();
    time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    cout << "Method C took " << time << "ms" << endl;
}

Rust:

use std::{sync::Arc, thread, time::Instant};

fn add(x: &[u64]) -> u64 {
    x.iter().sum()
}
fn main() {
    let ulimit = 960000000;
    let mut x = Vec::from_iter(1..=ulimit);

    // methodA
    println!("sequential addition started");
    let mut a = Instant::now();
    let b = add(&x);
    println!("time for sequential : {:#?}", a.elapsed());
    println!("from sequential addition : {}", b);

    let cpus = num_cpus::get();
    println!("number of cpus : {}", cpus);

    // method C
    println!("parallel addition started");
    a = Instant::now();
    let refr = Arc::new(x);
    let d = ulimit / (cpus as u64);
    let mut thread_handles = Vec::with_capacity(d as usize);
    for i in 0..cpus {
        let y = refr.clone();
        let d2 = d as usize;
        let x = thread::spawn(move || add(&y[(i * d2)..((i + 1) * d2)]));
        thread_handles.push(x);
    }
    let mut res = 0;
    for handle in thread_handles.into_iter() {
        res += handle.join().unwrap();
    }
    println!("time for parallel : {:#?}", a.elapsed());
    println!("from parallel addition : {}", res);
}

C++ was compiled with :
g++ -pthread -O3 -ffast-math x.cpp
Rust was compiled with:
cargo r -- --release
(with no profile in Cargo.toml)
I am aware of the two major differences between them:

  1. Use of atomic in C++ code.
  2. returning u64 in Rust.

Upon executing the binaries, consistently, I find that the C++ implementation is significantly faster (~4x in multi-threaded, ~11x in single).
Sample:
C++:

Method A took 264ms
Method C took 97ms

Rust:

time for sequential : 3.2898265s
time for parallel : 415.4457ms

Both are being run on WSL2. (g++and stable rust). Apart from the differences pointed out, is there anything else that could be contributing to the slowness of the rust program ?

cargo r -- everything is ignored here runs your program in unoptimized debug mode.

I see the extra ' -- ' I inserted in there. silly mistake. But after reducing it Rust while faster, is still taking around 2x for the same (only for multi-threaded, single thread is equivalent now). Removing the mistake definitely has helped.

In a situation like this, where the core of what you're comparing is a very simple function, a good way to figure out what is causing performance to be different is to study the machine code produced by both compilers. You don't have to be able to read the entire assembly and execute it in your head — just spot loops and function calls, and look up what sort of instructions are in each loop. Compiler Explorer is a useful tool for this — it lets you run Rust and C++ compilers and look at the disassembly of the output immediately.

There is no -ffast-math equivalent for Rust. (-ffast-math discards various sorts of correctness, so Rust probably won't ever have it — at least as a compilation option rather than something explicitly requested within code.) However, that doesn't make a difference here, because as far as I know, it only affects floating-point numbers, and you're using only integers.

You can use std::thread::available_parallelism() instead to reduce dependencies.

In order to gain a little performance and simplicity, you can avoid Arc and use scoped threads which can simply borrow the input data. This is unlikely to make a significant difference, since it only affects setup cost and not the addition loop, but it makes the code simpler to write.

4 Likes

It looks like you are dividing the work into 16 chunks, but then only spawning threads for the first 8 chunks.

3 Likes

The bitwise shift (<< '\n') on this line will also cause your C++ function to calculate the wrong result.

2 Likes

all in all a bunch of silly mistakes here. thanks for pointing all of them out.

this must have been some copy-paste mistake. But you are right, wasn't even looking at the results of the addition. Thanks for taking the time out to point it all out.

This is very informative! I had made a bunch of mistakes as pointed out by @mbrubeck that were the reason for slowness (AND CORRECTNESS) of the code. Thanks for a very detailed response.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.