[SOLVED] Use rayon is 7x slower


When I was trying to make the logic faster by introducing rayon, the result turns out to be quite the opposite, rayon is much slower.

I just switched:

for i in y..y_end {
    for j in x..x_end {
        vector[i][j] = (vector[i][j] << 1) + 1;


vector.par_iter_mut().skip(y).take(h).for_each(|row| {
    for j in x..x_end {
        row[j] = (row[j] << 1) + 1

Is there good a way to inspect what went wrong?

!!! Spoiler Alter !!!
link contains an answer to advent of code
full working code in gist!

did some perf sampling:


How big are the patches? If the work to be done is simple enough, or there isn’t enough of it, rayon’s small amount of overhead will absolutely make things slower.

Edit: You might also be interested in http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206, but I don’t think you’re hitting that, if my reading of your indexing is correct.


It was ~4ms vs ~32ms.

solution2 time: [4.5103 ms 4.5263 ms 4.5443 ms]
solution3 time: [32.312 ms 32.501 ms 32.730 ms]


I think he meant how big is the range x…x_end.

But if your solution is taking 4ms over all, then I think it makes sense that introducing threading slows it down. Any threading, even highly tuned stuff like rayon, has overhead. If you aren’t doing enough computation, that overhead won’t be worth it.

4 milliseconds is an extremely short time, especially considering that you do IO within your benchmark?

Splitting the work up into y_end - y different takes is only a valuable thing to do if each of those tasks takes at least a few milliseconds to complete, as I understand it.

If you want to pursue it more though, could you possibly include your input in the gist? I haven’t done advent of code and thus don’t have my own input to put in.

I think the benchmark could probably be made more accurate if we do the file reading outside of the criterion bench_function call.

1 Like

The (x << 1) + 1 work inside the inner loop is also something I suspect can be auto-vectorised, and that will further exacerbate the relative overhead. It might even be able to do further optimisation around the outer loop too in the original case given contiguous layout.

Regardless, unless x .. x_end is reasonably small, and y .. y_end is rather large, setup overhead is going to dominate.

1 Like

I tried to make x..x_end 100 larger, now it is faster with rayon