Iterators faster than for loops in parallel

Hello,

Running some benchmarks, I realized that in some cases iterators ran faster than for loops in parallel. See the following example:

use rayon::prelude::*;

use std::time::Instant;

fn some_function(x: f64) -> f64 {
   x.powf(-3.1415)
}

fn main() {

   let nb_operations = 100_000_000;
   let nb_threads = 12;
   let op_per_thread = nb_operations / nb_threads;

   let input_list = vec![3.1415; nb_operations];

   // compute the result using a for loop
   let mut results = Vec::with_capacity(nb_operations);
   let t = Instant::now();
   for &x in &input_list {
       results.push(some_function(x))
   }
   println!(
       "[ For loop | single thread ] elapsed time: {}",
       t.elapsed().as_secs_f32()
   );

   // compute the result using an iterator + collect
   let t = Instant::now();
   let results: Vec<f64> = input_list.iter().map(|&x| some_function(x)).collect();
   println!(
       "[ Iterator | single thread ] elapsed time: {}",
       t.elapsed().as_secs_f32()
   );

   // compute the result in parallel using for loops (by chunks)
   let f_chunk = |(i, result): (usize, &mut Vec<f64>)| {
       for &x in &input_list[op_per_thread * i..op_per_thread * (i + 1)] {
           result.push(some_function(x))
       }
   };

   let mut thread_results = vec![Vec::with_capacity(op_per_thread); nb_threads];
   let t = Instant::now();
   thread_results
       .par_iter_mut()
       .enumerate()
       .for_each(f_chunk);

   println!(
       "[ For loop | nb_threads={} ] elapsed time: {}",
       nb_threads,
       t.elapsed().as_secs_f32()
   );

   // compute the result in parallel using iterators + collect(by chunks)
   let g_chunk = |i| {
       input_list[op_per_thread * i..op_per_thread * (i + 1)]
           .iter()
           .map(|&x| some_function(x))
   };

   let mut thread_results = vec![Vec::with_capacity(op_per_thread); nb_threads];
   let t = Instant::now();
   thread_results
       .par_iter_mut()
       .enumerate()
       .for_each(|(i, result)| *result = g_chunk(i).collect());

   println!(
       "[ Iterator | nb_threads={} ] elapsed time: {}",
       nb_threads,
       t.elapsed().as_secs_f32()
   );
}

On my machine, it gives

The two versions take almost the same time to run, except when run in parallel, in which case the iterator version is twice as fast.

What's the reason for that ? And is there a way to have a for-loop version run as fast as an iterator version in similar cases ?

I think you are measuring allocation overhead.

Compare:

// compute the result in parallel using for loops (by chunks) (without allocation)
let f_chunk = |(i, result): (usize, &mut Vec<f64>)| {
    for (j, &x) in input_list[op_per_thread * i..op_per_thread * (i + 1)].iter().enumerate() {
        result[j] = some_function(x);
    }
};

let mut thread_results = vec![vec![0.0_f64; op_per_thread]; nb_threads];
let t = Instant::now();
thread_results
    .par_iter_mut()
    .enumerate()
    .for_each(f_chunk);

println!(
    "[ For loop/No-alloc | nb_threads={} ] elapsed time: {}",
    nb_threads,
    t.elapsed().as_secs_f32()
);

and ...

// compute the result in parallel using iterators (by chunks) (without allocation)
let g_chunk = |(i, result): (usize, &mut Vec<f64>)| {
    input_list[op_per_thread * i..op_per_thread * (i + 1)]
        .iter()
        .enumerate()
        .for_each(|(j, &x)| result[j] = some_function(x));
};

let mut thread_results = vec![vec![0.0_f64; op_per_thread]; nb_threads];
let t = Instant::now();
thread_results
    .par_iter_mut()
    .enumerate()
    .for_each(g_chunk);

println!(
    "[ Iterator/No-alloc | nb_threads={} ] elapsed time: {}",
    nb_threads,
    t.elapsed().as_secs_f32()
);

Totally unscientific results:

[ For loop | single thread ] elapsed time: 2.1558146
[ Iterator | single thread ] elapsed time: 2.0276945
[ For loop | nb_threads=8 ] elapsed time: 1.0456148
[ For loop/No-alloc | nb_threads=8 ] elapsed time: 0.39866054
[ Iterator | nb_threads=8 ] elapsed time: 0.4765062
[ Iterator/No-alloc | nb_threads=8 ] elapsed time: 0.40528947
1 Like

I don't see a good reason for a practical difference.

Have you tried sticking this in a benchmarking framework, or at least doing some warmup iterations before running each of these? I feel like your current code could easily be biased by the CPU being warmed up doing the first parallel execution, and then running faster from the start on the second one, or similar with caches for multiple cores being filled.

I can't test this on my computer; it doesn't have enough CPU cores to see the effect. But if you're interested, could you try running each of the tests 5 times before doing the one where you measure, just to ensure you're getting a fair benchmark? Something like this for each benchmark:

   let mut thread_results = vec![Vec::with_capacity(op_per_thread); nb_threads];
   thread_results
       .par_iter_mut()
       .enumerate()
       .for_each(|(i, result)| *result = g_chunk(i).collect());
}

let mut thread_results = vec![Vec::with_capacity(op_per_thread); nb_threads];
let t = Instant::now();
thread_results
   .par_iter_mut()
   .enumerate()
   .for_each(|(i, result)| *result = g_chunk(i).collect());

It'd be better to use a framework like criterion which could give real guarantees that calculations aren't optimized out, and produce statistics for the results, but migrating to that is extra work compared to just adding a warmup period.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.