Hello, i'm having a problem with the following code:

fn floyd_version_parallel(graph: &mut Array2<f32>) {
for k in 0..MAX {
(0..NUM_THREADS).into_par_iter().for_each(|id| {
for i in
(id * (MAX / NUM_THREADS))..(id * (MAX / NUM_THREADS) + (MAX / NUM_THREADS) - 1)
{
for j in 0..MAX {
unsafe {
let sum = *graph.uget((i, k)) + *graph.uget((k, j));
if sum < *graph.uget_mut((i, j)) {
*graph.uget_mut((i, j)) = sum;
}
}
}
}
});
}
}

In this code i'm using ndarray and rayon. I want to paralelize the compute so i made an into_par_iter. My problem is ** graph.uget_mut((i, j))* because i need a muted cell, that it's the following:

cannot borrow *graph as mutable, as Fn closures cannot mutate their captured variables

I could make a Mutex for all the graph, but i only need to modify one cell per time (i,j). The other cells (i,k and k,j) i use for read only.

To safely share data with rayon, any region of memory that you wish to mutate from multiple threads must be shared by having rayon's par_iter construct (or similar) split it, which makes it available as an argument to the closure. You can't just move the &mut T into the closure, because a mutable reference requires exclusive access to what it points at, and moving it to multiple threads normally would violate that.

For example, to add five to every element in an array, you must do this:

let mut v = vec![1, 2, 3, 4];
v.par_iter_mut()
.for_each(|x| {
*x += 5;
});

The difference is that the first version ensures that each call to the closure has access only to that specific item in the vector. This means that there aren't several threads with &mut T references that overlap, which is the issue in the second version where every spawned task has a mutable reference to the vector, and those all overlap with each other.

I understand what you say. My problem is that i need to do something similar than your second example. Using Mutex in all of the graph worked, but works like 4 times worse than secuencial algorithm. How can i remove the overhead of locks ?

Parallelizing algorithms with data dependencies criss-crossing like that can be very difficult. Take a look at the wiki link above, which talks about some approaches to do it.

@alice Yes, you are right, but for example: my secuencial version takes 600ms and my parallel version 3 seconds. And that is cause by the .lock.

I know that i will not get a correct result, but it's just an example:

pub struct MyData {
data: Array2<f32>,
}
impl MyData {
// Another static method, taking two arguments:
fn new(data: Array2<f32>) -> MyData {
MyData { data: data }
}
}
fn floyd_version_parallel(graph_aux: &mut Array2<f32>) {
let graph = Mutex::new(MyData::new(graph_aux.clone()));
for k in 0..MAX {
(0..NUM_THREADS).into_par_iter().for_each(|id| {
let mut graph = graph.lock().unwrap();
for i in
(id * (MAX / NUM_THREADS))..(id * (MAX / NUM_THREADS) + (MAX / NUM_THREADS) - 1)
{
for j in 0..MAX {
unsafe {
let sum = *graph.data.uget((i, k)) + *graph.data.uget((k, j));
if sum < *graph.data.uget((i, j)) {
*graph.data.uget_mut((i, j)) = sum;
}
}
}
}
});
}
}

This versions takes at least 2 seconds more than the following secuencial version:

fn floyd_version(graph: &mut Array2<f32>) {
let mut sum;
for k in 0..MAX {
for i in 0..MAX {
for j in 0..MAX {
unsafe {
sum = *graph.uget((i, k)) + *graph.uget((k, j));
if sum < *graph.uget_mut((i, j)) {
*graph.uget_mut((i, j)) = sum;
}
}
}
}
}