Parallelism, choosing between or combining Rayon and SIMD

Playground

Hello,
I like how Rust nicely exposes these parallelism options, and I’m trying to learn more on this. The playground link shows 3 versions, I was expecting the one with SIMD parallelism over Rayon chunks to be the most optimized. It seems to me (system monitoring and asm) that this version keeps the SIMD optimization while using all my laptop CPUs. However, a simple for_each gives the same times. I think it is relevant that the Rayon version with chunks doesn’t use the CPUs at 100%, differently for the pure SIMD with single auto-vectorized iterator. But I would appreciate your feedback on this as I’m really not an expert, moving from Python to Rust and lower level programming aspects. For example, these questions that I hope may be of general interest. Is it because the CPU isn’t the actual bottleneck? How would it change if rather than chunks over a single vector there were separate vectors? In general, do we expect different behavior with slice, because of possible faster access?
Note that the allocation is about 3/4 of the entire execution (see playground).

Thank you for the numerous threads and support. If you find I missed something, please simply reference it here.

Other relevant links
No speedup for parallel loop with rayon
Youtube Rayon
Rayon slower than serial algorithm

1 Like

Rayon vs SIMD is not an exclusive choice. For fast algorithms you're going to need both at the same time.

Splitting work given to Rayon into large chunks is the correct approach. This is necessary for autovectorization to be effective.

Keep in mind that in modern computers RAM is the slowest component (RAM is the new disk). Simple operations are going to be limited by the speed of RAM, so adding more cores doesn't help until you perform enough computation for the CPU to become the bottleneck again. In your case initialization of the vector with 1. and multiplication are both limited by the speed of RAM.

You're going to need at least dozens of operations per element to notice a difference.

7 Likes

Well. If I take this code:

use fast_floats::*;

pub fn dot_product(xs: &[f32], ys: &[f32]) -> f32 {
    xs.iter()
        .zip(ys)
        .fold(Fast(0.), |acc, (&x, &y)| acc + Fast(x) * Fast(y))
        .get()
}

pub fn convolution_serial(sample: &[f32], coeff: &[f32]) -> Vec<f32> {
    sample
        // Get sequence of windows that "slide" over the sample data
        .windows(coeff.len())
        // Form the dot product of every window in the sequence
        .map(|window| {
            window
                .iter()
                .zip(coeff)
                .fold(Fast(0.), |acc, (&x, &y)| acc + Fast(x) * Fast(y))
                .get()
        })
        // Map produces an iterator so we have to assemble a vector from that.
        .collect()
}

It runs as fast as anything you can do in C, making use of the available vector instructions of your machine.

By a simple change of ".windows" to ".par_windows" it spreads the work over multiple cores and in my tests scales linearly to at least 8 cores (Wish I had more).

And yes, it maxes all cores out to 100% whilst working.

use rayon::prelude::*;

pub fn convolution_parallel(sample: &[f32], coeff: &[f32]) -> Vec<f32> {
    sample
        // Get sequence of windows that "slide" over the sample data
        .par_windows(coeff.len())
        // Form the dot product of every window in the sequence
        .map(|window| dot_product(window, coeff))
        // Map produces an iterator so we have to assemble a vector from that.
        .collect()
}

Note: Don't be confused by that "dot_product" function above, it all gets inlined as effectively the same.

Of course this all depends on the size of the data set you are working on. I time the above with 20 million element arrays. If your data is small enough the overheads of scheduling cores to do the work take more time than doing the work!

2 Likes

Thank you kornel for linking that interesting reading, it addressed my doubt on the CPUs not being the bottleneck. I perfectly agree on the possible benefits of combining SIMD and Rayon, but your comment on the need of more operations seems to support my other doubts.

Rust/LLVM can easily auto-vectorize when there are few/simple operations per element, hitting the RAM limit. Thus, Rayon would not give relevant additional benefits, in this case. On the other hand, Rayon becomes more and more beneficial as the number and complexity of operation (per element) increases, but this also makes it harder and harder for Rust/LLVM to auto-vectorize these same operations. If the operations are complex enough, though, they will likely occupy the CPUs even without SIMD parallelism, then "choosing" to focus on Rayon seems ok.

ZiCog shows that there are cases where the operation(s) is actually sufficiently CPU intense and still auto-vectorized. In the other cases with complex and intense operations, the SIMD parallelism could be manually ensured SIMD intrinsics. But this is beyond the scope of the thread I guess.
Thank you ZiCog, I'm actually a geophysicist and this is quite relevant.

Thank you both of you for your help! While I recognize that the above is just a first generalization, it is a good starting point for me.

1 Like

I am no expert in "go faster" numeric computations. But it was interesting to find the suggestions for accelerating that convolution manually with intrinsics or whatever (I forget the details) turned out to be slower than convincing the compiler to sort it out. The bonus being that if the compiler can do it the code still works on other architectures that don't support the same SIMD instructions.

1 Like

One thing to watch out for is that you can have a cpu listed at 100% usage but is still not running at max capacity because it’s waiting on memory. No idea if that’s happening in your case, but here’s a blog post discussing it a bit.

CPU Utilization is Wrong

2 Likes

Interesting. I don't normally pay much attention to CPU usage, like the output of 'top' for things like that convolution. Except to check it is actually using as many cores as I think it should.

At the end of the day I time it. Faster is better!

Perhaps getting down to more detail might be useful if you suspect your code is not cache friendly.

Thanks for the clarifications on the specific case. I agree even with the general idea. I meant that intrinsics could be use to guarantee that the operations of each (Rayon) loop take advantage of SIMD parallelism when their complexity increases (i.e, when convincing the compiler becomes harder but using Rayon makes more sense as the CPUs are busy). You likely saw this too
auto-vectorization
cheers

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.