Iterators faster than for loops in parallel

lzed · May 11, 2020, 1:11am

Hello,

Running some benchmarks, I realized that in some cases iterators ran faster than for loops in parallel. See the following example:

use rayon::prelude::*;

use std::time::Instant;

fn some_function(x: f64) -> f64 {
   x.powf(-3.1415)
}

fn main() {

   let nb_operations = 100_000_000;
   let nb_threads = 12;
   let op_per_thread = nb_operations / nb_threads;

   let input_list = vec![3.1415; nb_operations];

   // compute the result using a for loop
   let mut results = Vec::with_capacity(nb_operations);
   let t = Instant::now();
   for &x in &input_list {
       results.push(some_function(x))
   }
   println!(
       "[ For loop | single thread ] elapsed time: {}",
       t.elapsed().as_secs_f32()
   );

   // compute the result using an iterator + collect
   let t = Instant::now();
   let results: Vec<f64> = input_list.iter().map(|&x| some_function(x)).collect();
   println!(
       "[ Iterator | single thread ] elapsed time: {}",
       t.elapsed().as_secs_f32()
   );

   // compute the result in parallel using for loops (by chunks)
   let f_chunk = |(i, result): (usize, &mut Vec<f64>)| {
       for &x in &input_list[op_per_thread * i..op_per_thread * (i + 1)] {
           result.push(some_function(x))
       }
   };

   let mut thread_results = vec![Vec::with_capacity(op_per_thread); nb_threads];
   let t = Instant::now();
   thread_results
       .par_iter_mut()
       .enumerate()
       .for_each(f_chunk);

   println!(
       "[ For loop | nb_threads={} ] elapsed time: {}",
       nb_threads,
       t.elapsed().as_secs_f32()
   );

   // compute the result in parallel using iterators + collect(by chunks)
   let g_chunk = |i| {
       input_list[op_per_thread * i..op_per_thread * (i + 1)]
           .iter()
           .map(|&x| some_function(x))
   };

   let mut thread_results = vec![Vec::with_capacity(op_per_thread); nb_threads];
   let t = Instant::now();
   thread_results
       .par_iter_mut()
       .enumerate()
       .for_each(|(i, result)| *result = g_chunk(i).collect());

   println!(
       "[ Iterator | nb_threads={} ] elapsed time: {}",
       nb_threads,
       t.elapsed().as_secs_f32()
   );
}

On my machine, it gives

The two versions take almost the same time to run, except when run in parallel, in which case the iterator version is twice as fast.

What's the reason for that ? And is there a way to have a for-loop version run as fast as an iterator version in similar cases ?

parasyte · May 11, 2020, 2:50am

I think you are measuring allocation overhead.

Compare:

// compute the result in parallel using for loops (by chunks) (without allocation)
let f_chunk = |(i, result): (usize, &mut Vec<f64>)| {
    for (j, &x) in input_list[op_per_thread * i..op_per_thread * (i + 1)].iter().enumerate() {
        result[j] = some_function(x);
    }
};

let mut thread_results = vec![vec![0.0_f64; op_per_thread]; nb_threads];
let t = Instant::now();
thread_results
    .par_iter_mut()
    .enumerate()
    .for_each(f_chunk);

println!(
    "[ For loop/No-alloc | nb_threads={} ] elapsed time: {}",
    nb_threads,
    t.elapsed().as_secs_f32()
);

and ...

// compute the result in parallel using iterators (by chunks) (without allocation)
let g_chunk = |(i, result): (usize, &mut Vec<f64>)| {
    input_list[op_per_thread * i..op_per_thread * (i + 1)]
        .iter()
        .enumerate()
        .for_each(|(j, &x)| result[j] = some_function(x));
};

let mut thread_results = vec![vec![0.0_f64; op_per_thread]; nb_threads];
let t = Instant::now();
thread_results
    .par_iter_mut()
    .enumerate()
    .for_each(g_chunk);

println!(
    "[ Iterator/No-alloc | nb_threads={} ] elapsed time: {}",
    nb_threads,
    t.elapsed().as_secs_f32()
);

Totally unscientific results:

[ For loop | single thread ] elapsed time: 2.1558146
[ Iterator | single thread ] elapsed time: 2.0276945
[ For loop | nb_threads=8 ] elapsed time: 1.0456148
[ For loop/No-alloc | nb_threads=8 ] elapsed time: 0.39866054
[ Iterator | nb_threads=8 ] elapsed time: 0.4765062
[ Iterator/No-alloc | nb_threads=8 ] elapsed time: 0.40528947

daboross · May 11, 2020, 2:52am

I don't see a good reason for a practical difference.

Have you tried sticking this in a benchmarking framework, or at least doing some warmup iterations before running each of these? I feel like your current code could easily be biased by the CPU being warmed up doing the first parallel execution, and then running faster from the start on the second one, or similar with caches for multiple cores being filled.

I can't test this on my computer; it doesn't have enough CPU cores to see the effect. But if you're interested, could you try running each of the tests 5 times before doing the one where you measure, just to ensure you're getting a fair benchmark? Something like this for each benchmark:

   let mut thread_results = vec![Vec::with_capacity(op_per_thread); nb_threads];
   thread_results
       .par_iter_mut()
       .enumerate()
       .for_each(|(i, result)| *result = g_chunk(i).collect());
}

let mut thread_results = vec![Vec::with_capacity(op_per_thread); nb_threads];
let t = Instant::now();
thread_results
   .par_iter_mut()
   .enumerate()
   .for_each(|(i, result)| *result = g_chunk(i).collect());

It'd be better to use a framework like criterion which could give real guarantees that calculations aren't optimized out, and produce statistics for the results, but migrating to that is extra work compared to just adding a warmup period.

system · August 9, 2020, 2:52am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
We all know `iter` is faster than `loop`, but why?	23	10466	February 16, 2021
Bad performance with rayon?	8	1428	December 14, 2022
Another for loop vs iterator performance question help	9	1354	July 6, 2019
Speeding up parallel iteration over large data help	9	1517	June 7, 2022
Rayon par_iter_mut slower than serial help	4	1136	November 23, 2020

Iterators faster than for loops in parallel

Related Topics