[SOLVED] Use rayon is 7x slower

#1

When I was trying to make the logic faster by introducing rayon, the result turns out to be quite the opposite, rayon is much slower.

I just switched:

``````for i in y..y_end {
for j in x..x_end {
vector[i][j] = (vector[i][j] << 1) + 1;
}
}
``````

to:

``````vector.par_iter_mut().skip(y).take(h).for_each(|row| {
for j in x..x_end {
row[j] = (row[j] << 1) + 1
}
});
``````

Is there good a way to inspect what went wrong?

!!! Spoiler Alter !!!
full working code in gist!

did some perf sampling:

#2

How big are the patches? If the work to be done is simple enough, or there isnâ€™t enough of it, rayonâ€™s small amount of overhead will absolutely make things slower.

Edit: You might also be interested in http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206, but I donâ€™t think youâ€™re hitting that, if my reading of your indexing is correct.

#3

It was ~4ms vs ~32ms.

solution2 time: [4.5103 ms 4.5263 ms 4.5443 ms]
solution3 time: [32.312 ms 32.501 ms 32.730 ms]

#4

I think he meant how big is the range xâ€¦x_end.

But if your solution is taking 4ms over all, then I think it makes sense that introducing threading slows it down. Any threading, even highly tuned stuff like rayon, has overhead. If you arenâ€™t doing enough computation, that overhead wonâ€™t be worth it.

4 milliseconds is an extremely short time, especially considering that you do IO within your benchmark?

Splitting the work up into `y_end - y` different takes is only a valuable thing to do if each of those tasks takes at least a few milliseconds to complete, as I understand it.

If you want to pursue it more though, could you possibly include your input in the gist? I havenâ€™t done advent of code and thus donâ€™t have my own input to put in.

I think the benchmark could probably be made more accurate if we do the file reading outside of the criterion bench_function call.

1 Like
#5

The `(x << 1) + 1` work inside the inner loop is also something I suspect can be auto-vectorised, and that will further exacerbate the relative overhead. It might even be able to do further optimisation around the outer loop too in the original case given contiguous layout.

Regardless, unless `x .. x_end` is reasonably small, and `y .. y_end` is rather large, setup overhead is going to dominate.

1 Like
#6

I tried to make `x..x_end` 100 larger, now it is faster with rayon

5 Likes