Rayon is slower than serial algorithm

Hello to everybody !

I need some help with this. I have the following code that takes 1s:

fn floyd_version(graph: &mut Array2<f32>) {
// let mut sum;
for k in 0..MAX {
    for i in 0..MAX {
        for j in 0..MAX {
            let sum = graph[[i, k]] + graph[[k, j]];
            if sum < graph[[i, j]] {
                graph[[i, j]] = sum;
            }
        }
    }
}

}

And i have this parallel code that takes 3s:

fn floyd_version_parallel(graph: &mut Array2<f32>) {
let num_threads = rayon::current_num_threads();
let s = MAX / num_threads;
let graph_mutex = Mutex::new(graph);
for k in 0..MAX {
    (0..num_threads).into_par_iter().for_each(|id| {
        let mut graph = graph_mutex.lock().unwrap();
        let init = s * id;
        let end = s * (id + 1);
        for i in init..end {
            for j in 0..MAX {
                let sum = graph[[i, k]] + graph[[k, j]];
                if sum < graph[[i, j]] {
                    graph[[i, j]] = sum;
                }
            }
        }
    });
}

}

What i'm doing bad ? In C using openmp and the same matrix, it takes 500ms and my parallel version it takes 3s. Any ideas ? I spent two days with this and i don't know what i have to do. Thank you !

Well, here is your problem. You have N threads waiting on the same lock to do some work.

5 Likes

Yep - in effect your threaded code is still running serially, just potentially not in the same order. So you're getting all of the overhead of threading with none of the benefits!

Taking a step back - are you running your original single-threaded code in release mode?

3 Likes

As mentioned in your other thread; you can't just blindly force algorithms to be run in multiple threads. They don't give consistent results or run times if reads and writes are dependent on some order.

There is no point comparing with serial, openmp or anything else.

Thanks for your answer !

Yes, i'm building with --release and running release. And yes, the problem is the lock, but i don't know how to avoid them, because it's the only way i could to modify data. What could i do ?

@jonh You are right but in this case Floyd can be parallelize, an example of openmp in c: method floyd warshall

In the openmp example you can see that the inner loops are parallelized and the outer one is not (the outer loop is inherently serial – its iteration depends on the previous step being finished).

You should take a step back – remove all the mutexes from your solution. Then, as a first step, try to convert the inner loop to use iterators instead of indexing. (I think the for j in 0..MAX loop is a good candidate). I have to admit I'm not sure how to achieve it, since you're using some library that provides Array2. Perhaps you should take a look at it docs – perhaps there's a rayon support built in? Anyway, if you manage to write the inner loop in more functional way, parallelizing with rayon should be a matter of changing iter_mut to par_iter_mut.

1 Like

ndarray does have rayon iterators, but you'll need to separate the mutations from those random-access reads. Perhaps you can "double-buffer" the iterations, so you read from a separate copy of the data from the previous k iteration.

The C version looks like it would have a data race in this regard. Rust won't let you do that.

1 Like

Hi, thanks again. This code works for me, but is approximately twice worse than openmp version:

fn floyd_version_parallel(graph_aux: &mut Array2<f32>) {
let graph_arc = Arc::new(&graph_aux);
let num_threads = rayon::current_num_threads();
let s = MAX / num_threads;
for k in 0..MAX {
    (0..num_threads).into_par_iter().for_each(|id| unsafe {
        let graph = graph_arc.clone();
        let init = s * id;
        let end = s * (id + 1);
        for i in init..end {
            for j in 0..MAX {
                let row = graph.uget((i, k));
                let col = graph.uget((k, j));
                let ij = graph.uget((i, j));
                let sum = *row + *col;
                if sum < *ij {
                    *(graph.as_ptr() as *mut f32).offset((i * MAX + j) as isize) = sum;
                }
            }
        }
    });
}

}

@cuviper but doing that i need to clone the matrix in each iteration, or not ?

This is a data race with undefined behavior. In practice, it probably just means you can't rely on whether or when exactly other threads will see the update, but this shouldn't be relied on.

It needs two copies of the data, but not necessarily reallocated each time. You can read the data from the previous iteration, and then write min(sum, ij) into the current iteration. I guess this can still be a conditional write if the "new" matrix already contains a value that's still minimal (from two iterations ago). On the next k, just mem::swap which matrix is considered old/new.

1 Like

Maybe in this specific scenario it would be more beneficial to use SIMD instead of regular looping?
This should be =< then C code...
rust SIMD

2 Likes

Thanks a lot !! I could improve the code. In this case, i don't use ndarray but a custom matrix. In this playground i share my code.

Comparing with openmp using this it taks around 10s more than openmp with n= 8192ñ

I really feel that this obviously could be improved. Any recommendations ?

@itemchenko i want to use SIMD but i don't understand how can i modify my code with that. Do you have some idea ?

Thank you to all !!

You cannot just trivially parallelize any algorithm. You need to use a parallel version of the algorithm. Wikipedia has an article https://en.wikipedia.org/wiki/Parallel_all-pairs_shortest_path_algorithm though it doesn't mention practical considerations and thus more research is needed if you want to have a practically fast implementation.

2 Likes

For SIMD you need to prepare data.
Rust's wrapper provides _mm256_add_ps function, which works with 256bit registries.
You have 32 bit floats, so 256 / 32 is you "Multiple Data" size.
Split your data into two input arrays, which will be used as addends and returns sum of those in same format.
Then you can do additional steps as usual flow.

Something like this:

use std::arch::x86_64::_mm256_add_ps;
for i in 0..incomming_data/8 {
    let a[f32; 8] = ...
    let b[f32; 8] = ...
    let c = unsafe {
        _mm256_add_ps(a, b)
    };
}
if{}else{} logic
1 Like

Dear all,

I would like to ask your opinion on a similar and possibly related doubt, about SIMD and Rayon.
I think Rust nicely exposes these parallelism options, and I'm trying to learn more on this.
I have three versions and I was expecting the one with SIMD parallelism over Rayon chunks to be the most optimized. It seems to me (system monitoring and asm) that this version keeps the SIMD optimization while using all my laptop CPUs. However, a simple for_each gives the same times. A key point I think is that the Rayon version with chunks doesn't use the CPUs at 100%. But I would appreciate your feedback on this as I'm really not an expert, moving from Python to Rust and lower-level programming aspects.
Below the playground link. Thank you for the helpful discussion in this forum.
https://play.rust-lang.org/?version=nightly&mode=release&edition=2018&gist=772d6ce32ea91b501c66f6c99991d2ce

It's best to create a new topic for a new issue. You can, of course, link back to this thread for reference.

1 Like