Comparison of a simple code between Rust and C!


#1

So, i have a code in Rust and in C that find the maximum number in a vector (in my case, the largest number are always in the last posisition). The thing is, i’m using linux time to measure these codes performance and C is giving the result 30x faster than Rust (real time), i’m quite sure that my code in C is working correctly, but as Rust is taking too long i think i’m doing something wrong.
I would like if someone could take a look and see if i’m doing something wrong with my Rust code, please !

- RUST CODE

- C CODE

THANK YOU !


#2

Two comments:

  1. Did you build the Rust code in release mode? (–release parameter to cargo, “release” switch in the top-left corner of the playground)
  2. How would you feel about using Rayon and simplifying this whole code into something like vector.par_iter().max()?

#3
  1. i did not tested time of execution in playground, i compiled it in my ubuntu (i5, 4 cored) machine.
    2 . o tried to use Rayon, but i think i dont know how to use it properly. I try do normally compile like any Rust program, and i keeps giving a error that cant find crate.

#4

You are cloning the vector in the Rust version. Borrow a slice instead. There is no reason to clone the vector. That’s a lot of heap allocation that the C version is not doing.


#5

I thought that is was just clonning the reference to the vector. Is this not what Arc do ?


#6

It’s cloning the Arc holding the vector. But I’d change the code to use Arc::clone(&vetor) to be more explicit.

I’d say make sure you’re building in release mode as @HadrienG said.


#7

how do i compile with release mode ?


#8

cargo run —release ...


#9

ohh, i think my proble is using cargo, then. I’m gonna have to look that up.

How can i borrow a slice of the vector ?


#10

You can’t borrow a slice of the vector here because thread::spawn needs a 'static closure, meaning it’s not borrowing anything (other than static refs, but that’s inapplicable to this case). That’s why the Arc is there.


#11

No, you are correct. I missed that you renamed/re-wrapped vetor in an Arc. You’d be better to use split_at and borrow the sub-set of the vector for each thread that that instance needs. Less overhead. Using Arc like this seems counter-productive.


#12

You can’t with thread::spawn :slight_smile:, as mentioned above. Rayon would allow it or crossbeam, but those are specialized APIs.


#13

byaaahhh…missed that.


#14

Enabling use of rayon is a three-step process:

  1. Add rayon = "1" to the [dependencies] section of your Cargo.toml (tells Cargo to download, build, and link with the proper version of Rayon).
  2. Add extern crate rayon; to your main.rs (tells the rustc compiler that you are going to use Rayon).
  3. Add use rayon::prelude::*; to your main.rs (brings Rayon’s parallel iterators into scope).

#15

For better comparison, I modified the code slightly, and now have three methods doing the same, one single-threaded, one using rayon, one explicitly multi-threaded. The explicit version is by far the fastest!

fn single_threaded(data: Arc<Vec<u8>>) -> u8 {
    data.iter()
        .fold(std::u8::MIN, |m, elem| std::cmp::max(m, *elem))
}

fn rayon_based_mt(data: Arc<Vec<u8>>) -> u8 {
    data.par_iter()
        .reduce(|| &std::u8::MIN, |a, b| std::cmp::max(a, b))
        .clone()
}

fn improved_original(data: Arc<Vec<u8>>) -> u8 {
    //------------------------------------------------
    let mut threads = vec![];
    for i in 0..N_THREADS {
        let v = data.clone();
        threads.push(thread::spawn(move || {
            let low = i * (ELEMENTS / N_THREADS);
            let high = if (i == N_THREADS - 1) && ((ELEMENTS % N_THREADS) != 0) {
                ELEMENTS
            } else {
                low + ELEMENTS / N_THREADS
            };

            v[low..high]
                .iter()
                .fold(std::u8::MIN, |m, elem| std::cmp::max(m, *elem))
        }));
    }
    threads
        .into_iter()
        .map(|t| t.join().unwrap())
        .max()
        .unwrap()
}

On quite a strong laptop, with release build, I got these results (ELEMENTS = 15_999_999_999):
single_threaded version took 1840 ms
rayon_based_mt version took 1497 ms
improved_original version took 721 ms

Unless I chose a bad implemention for rayon, one can say that optimizing the array evaluation by using iterators and adaptors makes the biggest effect. DIstributing work explicitly is awful, but can pay off.


#16

For cases where the actual work (cmp::max) is very small, I would suggest telling Rayon not to divide it too far, something like data.par_iter().with_min_len(1_000_000).

My guess is that it’s currently getting lost in the overhead of trying to adaptively split the array way down. Rayon doesn’t know how “heavy” the work actually is, or whether each item has proportional amounts of work, etc. Maybe we could be a little smarter about this, but it’s hard to generalize.

Side note, Iterator and ParallelIterator both have max() methods already.


#17

Hooolllyyyy… I loved your code, @emabee. Thanks a lot ! I’m pretty new to RUST and i learned a lot with your implementation.


#18

The tuning with with_min_len() did only allow a slight improvement.
The built-in max() functions are surprisingly much slower:
single_threaded version took 1900 ms
single_threaded_max version took 6336 ms
rayon_based version took 1423 ms
rayon_based_max version took 2282 ms
rayon_based_tuned version took 1312 ms
improved_original version took 652 ms
Original version took 2919 ms


#19

how did you measure exectution time of the program ?


#20

Like that:

extern crate chrono;
// ...
use chrono::Local;
// ...

    let start = Local::now();
    println!("Max: {}", single_threaded_max(data.clone()));
    println!(
        "single_threaded_max version took {} ms",
        (Local::now() - start).num_milliseconds()
    );