Dynamically using (or not using) rayon


#1

So here’s a dumb thing I just wrote:

use rayon::prelude::*;

impl GammaUnfolder {
    pub fn from_config(
        config: &Config,
        use_rayon: bool,
        superstructure: &Coords,
        sc_matrix: &ScMatrix,
        // eigenvector_q: &V3, // reduced by sc lattice
    ) -> GammaUnfolder
    {
        ...
        let func = |&quotient_q: &V3| {
            q_ket(
                &superstructure.to_carts(),
                &(eigenvector_q_cart + sample_q + quotient_q),
            )
        };
        let q_kets_by_sc = if use_rayon {
            quotient_vecs.par_iter().map(func).collect()
        } else {
            quotient_vecs.iter().map(func).collect()
        }
        ...
    }
}

Somewhere down the line in the same file I have a similar thing doing

    if use_rayon {
        a_iter.into_par_iter()
            .zip(b_iter)
            .for_each(|(a, b)| callback(a, b))
    } else {
        a_iter.into_iter()
            .zip(b_iter)
            .for_each(|(a, b)| callback(a, b))
    }

If these iterator chains got any larger than I’d have to start “factoring” them into macros instead!

Is there anything like a magic

vec.into_par_iter()
    .use_threads(false)

? Does with_min_len(usize::MAX) do the trick, or is that just going to make a bunch of extra spinning threads? (or maybe it does spawn a bunch of extra threads, but the unused ones will play nice and avoid fighting for CPU time?)


My aim with these flags is really just to help control where threads are created. i.e. An expensive function might be called from several places, some of which already have threads (or e.g. calls some C function that uses OpenMP); in those places, I think should probably avoid creating another layer of threads, lest I end up with an exponential number of threads all fighting each other.


#2

Hmm, that’s an interesting dilemma.

This will mostly work, and the unused threads should go completely idle. But for exceptions, Chain will unconditionally join its two sides in parallel, and for FlatMap you’d have to apply with_min_len on the mapped iterator too.

The surest way to serialize it is to create a private single-thread “pool”. This will also have the effect of serializing any other rayon access within your iterator, for example calling rayon::join directly or nested parallel iterators. Something like:

fn maybe_parallel<T, F>(use_threads: bool, f: F) -> T
    where T: Send,
        F: FnOnce() -> T + Send,
{
    if use_threads {
        // running directly will let it use the current thread pool
        f()
    } else {
        // run it in a single-threaded "pool"
        let pool = rayon::ThreadPoolBuilder::new().num_threads(1).build().unwrap();
        pool.install(f)
    }
}

Or if you don’t mind having a global effect on the program, you could just call this once:

rayon::ThreadPoolBuilder::new().num_threads(1).build_global().unwrap();

#3

I thought rayon will just create a number of threads depending on your cpu, and then execute its stuff in a work stealing fashion using threads from the global pool? So deciding between non parallel and parallel iterators should not be required.


#4

Oh, for this, Rayon’s global thread pool will only be created once for the life of the process, the first time it’s used. So there aren’t really any thread “layers” as far as that goes. And if you’re already in a Rayon threadpool, global or not, then all Rayon calls will continue using that, which is how the single-threaded hack I gave would work.

This doesn’t help in competing with other threadpools like OpenMP though.


#5

Ah, of course! I vaguely remembered having read about something like that before, but when it comes to my memory, my preconceptions seem to have a tendency to win out in the long run.

That is nice because, in general, most opportunities for parallelism in my code are of the “just toss in rayon” sort. And for instance, I can be certain (at least for now) that the particular example I posted above does not have to compete with OpenMP, because the control flow is “self-contained” (in that it takes no callbacks, and does the whole computation strictly to completion), and because I know my program only ever calls the C library from single-threaded contexts. (the library is not threadsafe!)

So it would seem that technically, in this circumstance, I could probably just write par_iter, forget about configurability, and call it a day.

…but I don’t know. Something feels unclean about having parallelism hidden behind a function call. I feel like there’s something valuable about being able to explicitly see a “threading” argument in high-level code like

let splines = reticulate_splines(
    &frogs,
    &knobs,
    Threading::Parallel,
)?;

But if explicitness is all I want in such cases, then… I think I could probably just pass around a threadpool. Which makes this trick:

The surest way to serialize it is to create a private single-thread “pool”. […]

awesomely convenient.


(or on second thought, combining the solutions might result to situations where I am both installing a threadpool globally and passing it down to other functions for them to install, which is weird. So I think I’ll drop the “pass around a threadpool” bit and just use your initial suggestion in earnest)


#6

I’m still thinking if there’s any way for Rayon to provide an adapter like this for arbitrary ParallelIterator.

In the meantime, here’s an experimental wrapper that lets you be conditionally serial or parallel up front, then provides an API of some methods that are common between Iterator and ParallelIterator. It feels like this might be overkill, and I don’t even know if it works, but it at least builds… :laughing:


#7

That’s pretty clever! I didn’t think it would be possible to reduce it down to O(1) overhead for a serial iterator like that. (though in my case, I think the work items are always large enough that I am not too concerned; and besides, rayon has adaptive chunk sizes).

The level of necessary boilerplate is also surprisingly small for an abstraction over different breeds of iterators!