Speeding up parallel iteration over large data

Hello! I am currently hitting some performance limitations of parallelization and am seeking advice to improve parallel performance.

My use case is:
I iterate over large sequences (>1M elements, total size > 1GB) and currently parallelize this with Rayon.
The sequence comes from a (potentially infinite) Iterator that is produced by a parser, so I use par_bridge to turn it into a parallel iterator. The total thing looks like: seq.par_bridge().try_for_each(handle).

I noted that parallel iteration incurs a significant overhead; this seems to come from the total size of the data (and not the number of elements). For this, I created a small test which creates some large data and treats it in parallel (Rust Playground):

#[derive(Clone)]
struct Tree(Vec<Tree>);

impl Tree {
    fn new(w: usize, h: usize) -> Self {
        if h == 0 {
            Self(Vec::new())
        } else {
            Self(vec![Tree::new(w, h - 1); w])
        }
    }
}

fn main() {
    use rayon::prelude::*;
    rayon::ThreadPoolBuilder::new()
        .num_threads(2)
        .build_global()
        .unwrap();
    let trees = (0..25).into_iter().map(|i| (i, Tree::new(2, i)));
    //let trees = trees.par_bridge(); // <-- uncomment this for benchmark #1
    //let trees = trees.collect::<Vec<_>>().into_par_iter(); // <-- uncomment this for benchmark #2
    trees.for_each(|(i, _tree)| println!("i: {}", i)); // <-- comment this for benchmark #3
    //chan_test(trees) // <-- uncomment this for benchmark #3
}

fn chan_test(iter: impl Iterator<Item = (usize, Tree)>) {
    let (s, r) = crossbeam_channel::bounded(2);
    crossbeam_utils::thread::scope(|scope| {
        scope.spawn(|_| r.iter().for_each(|_| println!("Thread 1")));
        scope.spawn(|_| r.iter().for_each(|_| println!("Thread 2")));
        iter.for_each(|(i, tree)| {
            println!("i: {}", i);
            s.send(tree).unwrap();
        });
        drop(s);
    })
    .unwrap();
}

The results are (evaluated with hyperfine, compiled with cargo build --release):

Benchmark #0: sequential
  Time (mean Β± Οƒ):      5.194 s Β±  0.051 s    [User: 4.744 s, System: 0.443 s]
  Range (min … max):    5.133 s …  5.289 s    10 runs

Benchmark #1: par_bridge()
  Time (mean Β± Οƒ):      6.628 s Β±  0.660 s    [User: 9.968 s, System: 2.521 s]
  Range (min … max):    5.843 s …  7.177 s    10 runs

Benchmark #2: collect().into_par_iter()
  Time (mean Β± Οƒ):      6.006 s Β±  0.162 s    [User: 6.916 s, System: 0.907 s]
  Range (min … max):    5.735 s …  6.217 s    10 runs

Benchmark #3: chan_test()
  Time (mean Β± Οƒ):      6.747 s Β±  0.134 s    [User: 8.452 s, System: 0.433 s]
  Range (min … max):    6.468 s …  6.953 s    10 runs

We can see that all parallel benchmarks (#1, #2, #3) are significantly slower than the sequential one (#0). Furthermore, the performance of benchmark #1 fluctuates very strongly.
Is there a way to concurrently treat my data faster, potentially without channels? Note that when processing an element of my sequence, I only need a reference to the element (here, &Tree).

1 Like

In this small test, you're mostly benchmarking the contended stdout lock. There's not really any parallel work available here.

1 Like

Thanks for your reply.
The benchmark (except for chan_test) prints only 25 lines, so I do not believe that the stdout lock notably influences performance.
Nonetheless, I redid the evaluation with println!(...) replaced by just (), and I obtained similar numbers as before.

And yes, this toy example does not perform any real parallel work; its point is only to show that there is significant overhead when transferring large data using channels. And its point is to inquire whether some more efficient technique can be used for treating large data in parallel.

Hmm, I see, the actual work here is in building and then dropping each Tree. For example, when I run perf record on #2, then perf report shows that more than half the time is spent in libc.so _int_free, and most of that on a single lock cmpxchg instruction. So parallelism is mostly hurting the allocator, in that case.

       β”‚2e8:   mov    %rdx,%rax
  0.11 β”‚       xor    %rsi,%rax
  0.45 β”‚       mov    %rax,0x10(%r12)
  0.49 β”‚       mov    %rdx,%rax
  0.03 β”‚       cmpl   $0x0,%fs:0x18
  0.55 β”‚     ↓ je     302
  0.40 β”‚       lock   cmpxchg %r12,(%rcx)
 67.95 β”‚       cmp    %rax,%rdx
  0.21 β”‚     ↑ je     130
       β”‚       mov    %rax,%rdx
  2.18 β”‚312:   cmp    %r12,%rdx
  0.08 β”‚     ↑ jne    2e8

(It shows on the cmp because the Ryzen cycles event isn't precise, so the interrupt skids.)

That's the general advice I would offer -- get familiar with your system profiler so you can find where you are actually spending time, and how that differs between your parallel and serial code. Then you can think about how to mitigate those hotspots.

2 Likes

Thanks for looking so deeply into this!

The way I now understand it is: when in parallel mode, the main thread basically clones the data in order to send it to the other thread, then frees its copy. The receiving thread then also frees the data a second time after having processed it.
Do you also understand it that way?
In that case, is there a way to avoid the duplicated freeing?

No, there's no implicit clone, the tree and all its heap ownership is being sent as-is to other threads. But the allocator has internal synchronization for its bookkeeping, so when you are rapidly freeing many things from multiple threads, that can be slower than if you freed the same amount from one thread.

So if that's the case in your real code as well, then maybe you can try to avoid multi-threading the allocation and freeing. You could construct them serially, then do your interesting work in parallel on references (like par_iter() or par_iter_mut()), and finally clean up again serially.

But you should measure perf, because the allocator might not be the real bottleneck in your workload. If there's enough "regular" parallel work to be done, then freeing from multiple threads might be spread out enough that they're not all storming the allocator at the same time.

Also, if you are hitting bottlenecks in the allocator, you might try a non-default one like MiMalloc β€” Rust memory management library // Lib.rs (or one of the many other #allocator // Lib.rs options).

1 Like

Thank you for your interesting explanation about allocation!
I believe that indeed, I have to do some perf measuring to do ...

@scottmcm, thank you ever so much for having suggested using a different allocator! I just tried mimalloc, and it improves performance drastically already in the single-threaded case.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.