Running some benchmarks, I realized that in some cases iterators ran faster than for loops in parallel. See the following example:
use rayon::prelude::*;
use std::time::Instant;
fn some_function(x: f64) -> f64 {
x.powf(-3.1415)
}
fn main() {
let nb_operations = 100_000_000;
let nb_threads = 12;
let op_per_thread = nb_operations / nb_threads;
let input_list = vec![3.1415; nb_operations];
// compute the result using a for loop
let mut results = Vec::with_capacity(nb_operations);
let t = Instant::now();
for &x in &input_list {
results.push(some_function(x))
}
println!(
"[ For loop | single thread ] elapsed time: {}",
t.elapsed().as_secs_f32()
);
// compute the result using an iterator + collect
let t = Instant::now();
let results: Vec<f64> = input_list.iter().map(|&x| some_function(x)).collect();
println!(
"[ Iterator | single thread ] elapsed time: {}",
t.elapsed().as_secs_f32()
);
// compute the result in parallel using for loops (by chunks)
let f_chunk = |(i, result): (usize, &mut Vec<f64>)| {
for &x in &input_list[op_per_thread * i..op_per_thread * (i + 1)] {
result.push(some_function(x))
}
};
let mut thread_results = vec![Vec::with_capacity(op_per_thread); nb_threads];
let t = Instant::now();
thread_results
.par_iter_mut()
.enumerate()
.for_each(f_chunk);
println!(
"[ For loop | nb_threads={} ] elapsed time: {}",
nb_threads,
t.elapsed().as_secs_f32()
);
// compute the result in parallel using iterators + collect(by chunks)
let g_chunk = |i| {
input_list[op_per_thread * i..op_per_thread * (i + 1)]
.iter()
.map(|&x| some_function(x))
};
let mut thread_results = vec![Vec::with_capacity(op_per_thread); nb_threads];
let t = Instant::now();
thread_results
.par_iter_mut()
.enumerate()
.for_each(|(i, result)| *result = g_chunk(i).collect());
println!(
"[ Iterator | nb_threads={} ] elapsed time: {}",
nb_threads,
t.elapsed().as_secs_f32()
);
}
I don't see a good reason for a practical difference.
Have you tried sticking this in a benchmarking framework, or at least doing some warmup iterations before running each of these? I feel like your current code could easily be biased by the CPU being warmed up doing the first parallel execution, and then running faster from the start on the second one, or similar with caches for multiple cores being filled.
I can't test this on my computer; it doesn't have enough CPU cores to see the effect. But if you're interested, could you try running each of the tests 5 times before doing the one where you measure, just to ensure you're getting a fair benchmark? Something like this for each benchmark:
let mut thread_results = vec![Vec::with_capacity(op_per_thread); nb_threads];
thread_results
.par_iter_mut()
.enumerate()
.for_each(|(i, result)| *result = g_chunk(i).collect());
}
let mut thread_results = vec![Vec::with_capacity(op_per_thread); nb_threads];
let t = Instant::now();
thread_results
.par_iter_mut()
.enumerate()
.for_each(|(i, result)| *result = g_chunk(i).collect());
It'd be better to use a framework like criterion which could give real guarantees that calculations aren't optimized out, and produce statistics for the results, but migrating to that is extra work compared to just adding a warmup period.