Optimizations in Rayon

I wrote a small program to see how can I myself implement parallel tasks and checked the execution time of 3 implementation of the same function which calculates the sum of a vector

use rand::Rng;
use rayon::iter::{IntoParallelRefIterator, ParallelIterator};
use std::{i32, sync::mpsc, thread, time::Instant};

fn find_sum_parallel(vec: &Vec<i32>) -> i32 {
    let start = Instant::now();

    let num_threads = 4;
    let len_per_thread = vec.len() / num_threads;
    let (sx, rx) = mpsc::channel();
    let mut final_sum = 0;

    thread::scope(|scope| {
        for i in 0..num_threads {
            let sender = sx.clone();
            scope.spawn(move || {
                let start_index = i * len_per_thread;
                let end_index = start_index + len_per_thread;
                let mut sum = 0;

                for i in start_index..end_index {
                    sum += vec[i];
                }

                sender.send(sum).unwrap();
            });
        }

        drop(sx);
    });

    while let Ok(val) = rx.recv() {
        final_sum += val;
    }

    let duration = start.elapsed(); // Measure the elapsed time
    println!("function took: {}", duration.as_secs_f32() * 10000.0);

    return final_sum;
}

fn find_sum(vec: &Vec<i32>) -> i32 {
    let start = Instant::now(); // Record the start time
    let mut sum = 0;

    for val in vec {
        sum += val;
    }

    let duration = start.elapsed(); // Measure the elapsed time
    println!("function took: {}", duration.as_secs_f32() * 10000.0);

    return sum;
}

fn find_sum_rayon(vec: &Vec<i32>) -> i32 {
    let start = Instant::now(); // Record the start time
    let sum_3: i32 = vec.par_iter().sum();
    let duration = start.elapsed(); // Measure the elapsed time

    println!("function took: {}", duration.as_secs_f32() * 10000.0);

    return sum_3;
}

fn main() {
    let max_nums = 10000000;
    let mut rng = rand::thread_rng();
    let vec: Vec<i32> = (0..max_nums).map(|_| rng.gen_range(1..10)).collect();

    let sum_1 = find_sum(&vec);
    let sum_2 = find_sum_parallel(&vec);
    let sum_3: i32 = find_sum_rayon(&vec);

    assert!(sum_1 == sum_2);
    assert!(sum_2 == sum_3)
}

The result for one of the runs is

my sequential function took: 462.91708
my parallel function took: 291.24417
rayon function took: 42.31666

My question is even for this simple program where I assumed custom implemented code would execute faster still rayon beats with a large difference. I don't understand what kind of optimizations I can put in my parallel implementation to reach rayon numbers ? Essentially what you guys feel can be improved in my parallel code to almost reach rayon performance ? Thanks

Try running in release mode. For me the differnces mostly disappear.
In debug mode i get:

sequential function took: 491.6196
parallel function took: 361.46378
rayon function took: 90.45666

and with release i get:

sequential function took: 15.30958
parallel function took: 10.075
rayon function took: 9.94167

Also note that there is a lot of variance (3-4 seconds in release mode) in between runs so sometimes rayon is faster and sometimes slower.

5 Likes

There might be a few different things that explain this.

First, rayon spawns threads early and lets them idle until there is work available for them. Spawning threads can be very expensive, you might not be measuring the thread startup time in the rayon function.

Second, rayon slices the vector, rather than using the index operator []. This has the benefit that bounds checks can be optimized out. I added a fourth case that shows the difference with slices:

And those numbers seem a bit high. Are you running a debug/non-optimized build?

edit: The Duration scaling is incorrect. It should multiply by 1_000 to change the unit to ms. You're multiplying by 10_000, which changes the unit to hundreds of microseconds. (I fixed this in my playground link.)

thanks @luca3s for the answer! I was indeed running in Debug mode. This makes sense now, thanks again

Thanks @parasyte for the response! I have added time measuring code wrapping the whole rayon function call which should also include the thread spawning execution time right ?

I don’t know the internals of rayon well enough to answer that. It will lazily start the threadpool. But I do not know how much overhead that has on the calling thread.