Calling Rayon Users: Opinions Wanted

Background

I'm contemplating a breaking change to the Rayon parallel iterator API and would appreciate some input. Currently, the parallel iterator API assumes that each individual work item is cheap, and therefore it tries to group together multiple work items onto one thread to reduce overhead. The weight() method can be used to control this, by making Rayon use a multiplier to consider some items as more expensive.

However, I've noticed in practice that most parallel iterators seem to just call weight_max(), which basically forces all items to be (potentially) launched onto their own threads. Moreover, a lot of people come to Rayon with a small list of expensive tasks and then get surprised when they don't see any parallelization (because they are not calling weight_max()).

To try and address this problem, I am contemplating changing the default so that Rayon assumes tasks are expensive by default. PR 81 deprecates weight() and weight_max() and introduces a new method, sequential_threshold(), which can be used instead. sequential_threshold() defaults to 1 but can be set higher to try -- when the number of items falls below the threshold, Rayon will try not to parallelize.

The questions at hand

  • Which do you think is the right default?
  • Can you think of a more elegant way to control the sequential cutoff point?

A couple of further thoughts

One problem with sequential threshold is that it does not compose as cleanly as weight did. For example, what should the threshold be if you chain two iterators together, and one has a threshold of 32 and the other 64? Really, the threshold makes the most sense as something you set as the last step only.

In theory, using Rayon with a threshold of 1 (that is, assuming expensive) is what you ought to do. We ought to tune the runtime so that this is low overhead and/or dynamically adapt to workloads. But in practice it doesn't always work out that way and I'm not sure that we'll be able to achieve it, so for the time being I'd like to keep the manual knob.

cc @cuviper

1 Like

Turns out that at least for me the biggest problem was documentation!

A story: after the simple_parallel crate had stopped compiling with stable rust, I spent some time (couple of hours at most) trying to replace a single parallel for in my ray tracer with par_iter(). But the performance was much worse than with simple_parallel. So after reading README.MD, glancing through API docs and trying a couple of random tricks I gave up. Needless to say .weight_max() was the thing I needed (I've learned about it just now from this post :slight_smile: )!

So, whatever the solution is, I think the problem itself should be described in readme :wink:

6 Likes

Yes, I fully agree with that. Just put two examples on the front page (README.MD):

  • one where the calculation is cheap, to describe the default behaviour.
  • a second one where the calculation is expensive where .weight() or weight_max() is used.

I think that makes it clear for everyone. So in my opinion no need to change / break compatibility.

Interestingly enough, I was experimenting with huge tasks / small tasks just a few hours ago with rayon. And I was quite surprised that the huge tasks were a lot slower than the smaller fine grained tasks. Pretty interesting coincidence that you posted this just now, which explains why the large tasks were so slow - I was not aware that the weights are that important. So I just tested my code again, and with weight_max() the large tasks are faster than the small tasks, while with weight_max() the smaller tasks are a lot slower than before. I'm not quite sure what the best solution to this is though.

I should have added one further thought, though: although the seq. threshold doesn't compose as cleanly, it has the advantage of being less abstract. The weight is kind of a cool notion but also a bit random, whereas the seq. threshold is very concrete.

1 Like

I use an adaptive partitioning algorithm in Async++, which is based on the one in TBB.

The basic idea is that you start off by splitting your initial range into N sub-ranges, where N is the number of CPUs in the system. After that, each time you are about to process a sub-range, you check if the thread ID is different from when you split the range. If it is different then it means that this subrange was stolen from another thread, so you split it another N times.

See the C++ code for details of the implementation.

5 Likes

Thanks for the link. I was planning on experimenting with some algorithms along these lines -- one other example I recall reading about is a heuristic that checks for whether the current task queue is empty, and splits if so, but otherwise avoids splitting. The idea being: if other threads are hungry and stealing work, then you should produce more work for them to take. =)

IMO what we need is a new abstraction, something like a Scheduler, such that:

  • one always needs to specify the scheduler of an invocation
  • one can provide a different scheduler to each invokation
  • one can provide user-defined schedulers (e.g. one that uses tokio, another one that uses some thread pool with work stealing, one that uses green-threads/coroutines, ...)
  • a default scheduler, on which tasks can easily be scheduled by just calling a single method like .schedule_default(), or maybe even implicitly (that is, nothing changes)
  • the ability to change the default scheduler at will for the whole program, and use a user-defined default scheduler instead.

Ideally, the library would uses futures, and the event loop in which the futures are schedulers should be configurable.

Note that other parallelism libraries like OpenMP already allow choosing different scheduling mechanism both globally and locally (e.g. static/dynamic scheduling can be chosen both for the whole program as a default, as well as differently for each parallel loop/task).

I doubt that something like weight/sequential_threshold is going to be enough, OpenMP has these and also has clauses like teams, distribute, ... that allow more fine-grained control. Since it is hard to say how far one should go, decoupling the scheduling mechanism from the library, would allow implementing and experimenting with different scheduling mechanisms that offer different levels of control.

I mean, for a default scheduler, does it really make sense to talk about sequential_threshold at all? I don't think that users using the default scheduler care, and writing a good default scheduler is hard. I think a good default scheduler should not assume / require any information like weight or sequential_threshold. A good default scheduler should time on program start the cost of a context switch between threads, the cost of spawning a thread, ... and have some internal heuristic based on a cost function for deciding how many tasks a thread gets based on how much time does a recent single task take.

Experimenting with things like these would be easier if we had some Scheduler trait, and users could implement their own.

1 Like

Strongly disagree. The primary purpose of rayon is to make it easy to convert sequential code to parallel code, by changing e.g. .iter() to .par_iter(). Adding more friction and overhead to this would remove much of the benefit of rayon.

1 Like

Quoting myself:

Why doesn't this address your concerns? As mentioned there the default scheduler could be used even implicitly if the user doesn't call some specific .schedule(scheduler) method.

What do you think about the friction introduced by weight/weight_max/sequential_threshold ?

I agree with gnzlbg's concept of a scheduler, although I think their proposal is a little too dramatic and ambitious. You could just have it control the way that items get sent to threads (and have no control over the threads themselves) and have two that emulate the current default and weight_max respectively, plus an adaptive scheduler, and use the latter by default and one of the first two via a method like cheap and expensive (or something) on the iterator. Also, I'd avoid setting a global default scheduler. Didn't your mother tell you about global mutable state?

I've been thinking about that, yes, but I don't think it's the full answer. What these thresholds are controlling is at a higher-level than the scheduler, effectively.

In any case, separating out the "backend" of a Rayon into a "semi-safe" thread scheduler is a goal of mine. I envision it being something like crossbeam: not a fully safe interface, but with the unsafety minimized as much as possible. This seems necessary to get the full advantages of Rayon (in particular, using exclusively stack allocations).

I opened PR106 based on this adaptive idea. It's looking good so far, but I'd love more folks to try it out. Note that it only uses the new way if there are no weights specified, so we could use feedback both on whether weighted code regresses and on how unweighted code now performs.