`par_iter` not running in parallel

KillTheMule · November 18, 2021, 4:06pm

I thought this would be an easy case, and indeed I got my loop converted to par_iter pretty easily. However, it does not seem to do a lot really... so here's the loop:

    let ress: Vec<_> = map.par_iter()
      .map(|(name, data)| {
        eprintln!("{} now", name);
        self
          .vorlage
          .render(&data, dir)
      })
      .collect();

It compiles and runs. However, while I do see rayon spawn as many threads as I have CPUs, only ever one loop body is executed at any time (the eprintln gives strong hints, as the function called runs several seconds for each). It's kinda easy to see this because render results in an external command (pdflatex) and one can see there's never more than one external process running.

So, how do I diagnose this, what could be the error? The full code is a bit involved and closed, but since the compiler accepts this, I don't see what could keep things from being parallelized. I do not have any synchronization in place (this was serial code after all, I did not rework a lot).

Some speculation on my side, which feels wrong, but I don't know how to investigate:

Each thread creates its own tempfile::tempdir to work in. Does that imply some sort of block?
Each thread copies some files into the tmpdir to the main directory via fs_extra::copy_items before doing work. Afterwards, it copies out some results again into the main directory. Could consecutive calls of copy_items somehow lock a directory or something?
data is a hashbrown::Hashmap with the rayon, serde features. I contains shared references into two other such HashMaps that live a much wider scope. Seeing this was made specifically for use with rayon's ParallelIterator I don't see how that would hurt, but I didn't really find a lot of mention in the hashbrown docs about pitfalls here.

I'd be glad for any pointers I can of course show some more type definitions or functions, but this is long enough as-is, so maybe someone just has an idea to throw out. Thanks for readin, anyways

steffahn · November 18, 2021, 4:32pm

AFAIK, rayon is more meant for CPU-bound tasks; for parallelizing IO-stuff, using async could be another option. Nonetheless, your parallel iterator should work in principle.

Looking at it’s documentation, I don’t feel like tempdir is a problem.

Skimming through the source of copy_items I don’t see a problem either.

Maybe you could debug this, e.g. by using a global variable to single out the first invocation of render and make its implementation block indefinitely at some place. If you find a place where such an indefinitely blocking operation manages to stop the whole program (including the other threads), then you

know that there really is a problem, and you
can move around the place where the block is introduced to single out the operation that is responsible

steffahn · November 18, 2021, 4:37pm

Something like

use std::sync::atomic::{AtomicBool, Ordering};
use std::thread;
fn block_on_first_call() {
    static FIRST_CALL: AtomicBool = AtomicBool::new(true);
    let first_call = FIRST_CALL.swap(false, Ordering::SeqCst);
    if first_call {
        loop {
            thread::park()
        }
    }
}

then insert a call to block_on_first_call() somewhere in your code and see if it locks the whole threadpool; preferrably in the of the part of the render operation that takes the longest time.

cole-miller · November 18, 2021, 4:45pm

I don't know whether this is the issue, but eprintln! does involve locking a mutex.

KillTheMule · November 18, 2021, 5:07pm

Very good idea. But the result is kinda what I expected: The closure body of the map above is called sequentially (i.e. when I put a call to block_on_first_call there instead of the eprintln!, it just immediately locks up).

But this all happens with the eprintln, it was just a convenient way to show the sequentiality (is that a word?).

steffahn · November 18, 2021, 5:20pm

It doesn’t even work when the block_on_first_call() is the first thing in the closure? Weird... for me, locally, something like

    let _: Vec<_> = [1, 2].par_iter().map(|i| {
        block_on_first_call();
        println!("{}", i);
    }).collect();

does manage to print a number.

KillTheMule · November 18, 2021, 5:31pm

Well, that did it. I replaced hashbrown::HashMap by Vec as the type for map, and now everything's fine! Runtime of my example test went from 18s to 7s, which is about what I'd expect in this case.

Not sure why it's the case though, I looked through some comments in hashbrown, but there was no mention of this behavior.

steffahn · November 18, 2021, 5:54pm

Interesting. Seems like

fn main() {
    let m = <HashSet<_>>::from_iter(0..15);
    let _: Vec<_> = m.par_iter().map(|i| {
        block_on_first_call();
        println!("{}", i);
    }).collect();
}

uses at least 2 threads in parallel, whereas

fn main() {
    let m = <HashSet<_>>::from_iter(0..14);
    let _: Vec<_> = m.par_iter().map(|i| {
        block_on_first_call();
        println!("{}", i);
    }).collect();
}

so a hash map with at most 14 elements doesn’t use more than one thread. Perhaps the ParallelIterator implementation of HashMap/HashSet somehow sets a lower bound on how small the parts are that the map can be split into?

steffahn · November 18, 2021, 6:04pm

Skimming through the code, I arrive at this part: mod.rs - source

which does indeed suggest a presence of a minimal size. Probably has some valid reasons behind it.

Anyways, as I mentioned, rayon is probably the wrong tool. If you want to try, consider using async fns, e.g. with tokio. You could use StreamExt::buffered to determine up to how many operations you want to do in parallel.

KillTheMule · November 18, 2021, 6:34pm

Hmm, are you sure? I mean it's pretty light on the IO, what's really eating time is the eternal process (mostly CPU bound, though I don't think that matters). I mean I guess async works as well since the OS will schedule the processes appropriately...

Anyways, seeing I've got it working thanks to you, and I really don't need to optimize the rust code itself very much, I don't think I want to spend time converting to async here. Needs other work, too, and I'm on a deadline

Thanks a lot for your help!

cuviper · November 18, 2021, 6:59pm

If you want more granular parallelism for smaller maps and sets, you could collect item references to a Vec first. That's what rayon does anyway for the std types, since it can't do splits on the internals -- that's painful for large lengths but probably fine in small cases.

system · February 16, 2022, 6:59pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Rayon multiple par_iter one after the other help	3	516	February 12, 2021
Rayon: Running parallel to `par_iter`	7	1416	September 23, 2022
Can't figure out how to parallelize an iterator help	10	3175	May 17, 2022
Rayon par_iter() is always slower than iter()	5	1642	December 25, 2021
Rayon par_iter_mut slower than serial help	4	1288	November 23, 2020

`par_iter` not running in parallel

Related topics