I wrote a small program to see how can I myself implement parallel tasks and checked the execution time of 3 implementation of the same function which calculates the sum of a vector
use rand::Rng;
use rayon::iter::{IntoParallelRefIterator, ParallelIterator};
use std::{i32, sync::mpsc, thread, time::Instant};
fn find_sum_parallel(vec: &Vec<i32>) -> i32 {
let start = Instant::now();
let num_threads = 4;
let len_per_thread = vec.len() / num_threads;
let (sx, rx) = mpsc::channel();
let mut final_sum = 0;
thread::scope(|scope| {
for i in 0..num_threads {
let sender = sx.clone();
scope.spawn(move || {
let start_index = i * len_per_thread;
let end_index = start_index + len_per_thread;
let mut sum = 0;
for i in start_index..end_index {
sum += vec[i];
}
sender.send(sum).unwrap();
});
}
drop(sx);
});
while let Ok(val) = rx.recv() {
final_sum += val;
}
let duration = start.elapsed(); // Measure the elapsed time
println!("function took: {}", duration.as_secs_f32() * 10000.0);
return final_sum;
}
fn find_sum(vec: &Vec<i32>) -> i32 {
let start = Instant::now(); // Record the start time
let mut sum = 0;
for val in vec {
sum += val;
}
let duration = start.elapsed(); // Measure the elapsed time
println!("function took: {}", duration.as_secs_f32() * 10000.0);
return sum;
}
fn find_sum_rayon(vec: &Vec<i32>) -> i32 {
let start = Instant::now(); // Record the start time
let sum_3: i32 = vec.par_iter().sum();
let duration = start.elapsed(); // Measure the elapsed time
println!("function took: {}", duration.as_secs_f32() * 10000.0);
return sum_3;
}
fn main() {
let max_nums = 10000000;
let mut rng = rand::thread_rng();
let vec: Vec<i32> = (0..max_nums).map(|_| rng.gen_range(1..10)).collect();
let sum_1 = find_sum(&vec);
let sum_2 = find_sum_parallel(&vec);
let sum_3: i32 = find_sum_rayon(&vec);
assert!(sum_1 == sum_2);
assert!(sum_2 == sum_3)
}
The result for one of the runs is
my sequential function took: 462.91708
my parallel function took: 291.24417
rayon function took: 42.31666
My question is even for this simple program where I assumed custom implemented code would execute faster still rayon beats with a large difference. I don't understand what kind of optimizations I can put in my parallel implementation to reach rayon numbers ? Essentially what you guys feel can be improved in my parallel code to almost reach rayon performance ? Thanks