[SOLVED] Use rayon is 7x slower

rockmen1 · February 12, 2019, 3:55am

When I was trying to make the logic faster by introducing rayon, the result turns out to be quite the opposite, rayon is much slower.

I just switched:

for i in y..y_end {
    for j in x..x_end {
        vector[i][j] = (vector[i][j] << 1) + 1;
    }
}

to:

vector.par_iter_mut().skip(y).take(h).for_each(|row| {
    for j in x..x_end {
        row[j] = (row[j] << 1) + 1
    }
});

Is there good a way to inspect what went wrong?

!!! Spoiler Alter !!!
link contains an answer to advent of code
full working code in gist!

did some perf sampling:

scottmcm · February 12, 2019, 4:15am

How big are the patches? If the work to be done is simple enough, or there isn't enough of it, rayon's small amount of overhead will absolutely make things slower.

Edit: You might also be interested in Eliminate False Sharing | Dr Dobb's, but I don't think you're hitting that, if my reading of your indexing is correct.

rockmen1 · February 12, 2019, 4:19am

It was ~4ms vs ~32ms.

solution2 time: [4.5103 ms 4.5263 ms 4.5443 ms]
solution3 time: [32.312 ms 32.501 ms 32.730 ms]

daboross · February 12, 2019, 5:10am

I think he meant how big is the range x..x_end.

But if your solution is taking 4ms over all, then I think it makes sense that introducing threading slows it down. Any threading, even highly tuned stuff like rayon, has overhead. If you aren't doing enough computation, that overhead won't be worth it.

4 milliseconds is an extremely short time, especially considering that you do IO within your benchmark?

Splitting the work up into y_end - y different takes is only a valuable thing to do if each of those tasks takes at least a few milliseconds to complete, as I understand it.

If you want to pursue it more though, could you possibly include your input in the gist? I haven't done advent of code and thus don't have my own input to put in.

I think the benchmark could probably be made more accurate if we do the file reading outside of the criterion bench_function call.

dcarosone · February 12, 2019, 5:18am

The (x << 1) + 1 work inside the inner loop is also something I suspect can be auto-vectorised, and that will further exacerbate the relative overhead. It might even be able to do further optimisation around the outer loop too in the original case given contiguous layout.

Regardless, unless x .. x_end is reasonably small, and y .. y_end is rather large, setup overhead is going to dominate.

rockmen1 · February 12, 2019, 6:32am

I tried to make x..x_end 100 larger, now it is faster with rayon

Topic		Replies	Views
No speedup for parallel loop with rayon help	9	2430	January 12, 2023
New version of mandel-rust: uses Rayon, added benchmark announcements	38	5848	January 12, 2023
Rayon par_iter_mut slower than serial help	4	1415	November 23, 2020
Rayon is slower than serial algorithm	15	2213	October 30, 2020
Is there anything obviously wrong with this benchmark?	4	375	December 1, 2022

[SOLVED] Use rayon is 7x slower

Related topics