Workflow for profiling Rayon

Hi,

I'm trying to tune some Rayon code, and getting a little stuck. I'm using perf + Firefox profiler, like this:


  1. I want debug symbols and minimal inlining, for better interpretability, so I set Cargo.toml to include

    [profile.release]
    opt-level = 1
    debug = true
    lto = false
    
  2. $ perf record --call-graph dwarf my_executable
    $ perf script -F +pid > perf.data.txt
    
  3. Open https://profiler.firefox.com/ in a browser

  4. Click "Load profile from file", then find perf.data.txt


This works great, and makes it easy to see intervals that are fully sequential. But I'd also like a way to identify ways to make the parallel implementation more efficient, for example cases where parallelism is too granular and would be better with e.g. par_chunks.

I was thinking maybe something like filtering for steal, or somehow finding what proportion of the samples are from my module (as opposed to only Rayon itself, which may indicate overhead). But filtering for steal doesn't seem so informative, and I'm not sure module filtering is even possible, or if it's "correct".

Maybe there's a better approach - any suggestions?

1 Like

I do not have an answer. I am trying to understand your question. Is the following accurate:

We have C cores, and N pieces of work. We want to minimize clock time.

We have to pick some K, such that the N pieces of work is divided into N/K chunks of size K each.

The total work done is N * w1 + (N/K) * w2; where w1 = cost of processing a unit, and w2 = cost of starting a chunk.

At one extreme, if K = N, we have no parallelism.
At the other extreme, if K = 1, we do N * w2 units of wasted work (cost of starting a chunk per element).

So now you are looking for a way to find a 'good' K, where the N * w1 part can be parallelized while the 'overhead' of (N/K) * w2 is minimized ?

Kind of. It's great when you can just par_iter and move on. But sometimes that can be much too granular. We could just always try par_chunks, but there's a fair amount of human overhead in that (at least for me - still spinning up on Rust).

So I'm hoping there might be a way to analyze the perf results to get an idea if it's worth the trouble of trying par_chunks. Or even better, maybe there's some mapreduce abstraction I don't know about that makes it easy to refactor from par_iter with a specification of the chunk size or number of chunks.

Again, I'm not sure if this answers your question or if I'm going off on a theoretical tangent. Suppose we pick K such that (N/K) = C * 100 , i.e. # of chunks = 100x # of cores.

On the wasted work side, we pay the cost of 100 * C * w2 -- which is not too much, it's cost of starting 100 chunks per core.

On the parallelism side, we're within 1% of optimal: for "100 units", all the cores are blasting at near full speed. At the very end, in the very degenerate of case, we might ahve a single core working alone on a chunk while all cores idle. This would be a case where the optimal time is 100T, and we take 101T -- so within 1%.

if the above analysis is right, all you need is just get # of cores and calculate the right K. I suspect Rayon probably already does something like this? [ If they have the length of the list & the # of cpus, seems like they can already do this ]

Have you tried with_min_len IndexedParallelIterator in rayon::iter - Rust

1 Like

Oh that's very nice!

I tried a little experiment, this was surprising to me. Adding 10 billion floating point values three different ways:

iter time                    = 774 ms
par_iter time                = 170 ms
par_chunks time (100 chunks) = 173 ms
used 32 threads
Code here
use rayon::prelude::*;
use rand_distr::{Distribution, Normal};
use rand_xoshiro::rand_core::SeedableRng;
use std::time::Instant;

fn main() {
    let mut rng = rand_xoshiro::Xoshiro256PlusPlus::seed_from_u64(0);
    let normal = Normal::new(0.0, 1.0).unwrap();
    let n = 1_000_000_000;
    let mut z = vec![0.0; n];
    z.iter_mut().for_each(|zj| {
        *zj = normal.sample(&mut rng);
    });

    // sum using iter
    let t0 = Instant::now();
    let s = z.iter().sum::<f64>();
    let dt = t0.elapsed();
    assert!(s.is_finite());
    println!("iter time                    = {:?} ms", dt.as_millis());

    // sum using par_iter
    let t0 = Instant::now();
    let s = z.par_iter().sum::<f64>();
    let dt = t0.elapsed();
    assert!(s.is_finite());
    println!("par_iter time                = {:?} ms", dt.as_millis());

    // sum using par_chunks
    let chunk_size = n / 100;
    let t0 = Instant::now();
    let s = z.par_chunks(chunk_size).map(|zj| zj.iter().sum::<f64>()).sum::<f64>();
    let dt = t0.elapsed();
    assert!(s.is_finite());
    println!("par_chunks time (100 chunks) = {:?} ms", dt.as_millis());

    println!("used {} threads", rayon::current_num_threads());
}

Still, it would be nice to have a way to better understand the profile results.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.