Hello to everybody !

I need some help with this. I have the following code that takes 1s:

```
fn floyd_version(graph: &mut Array2<f32>) {
// let mut sum;
for k in 0..MAX {
for i in 0..MAX {
for j in 0..MAX {
let sum = graph[[i, k]] + graph[[k, j]];
if sum < graph[[i, j]] {
graph[[i, j]] = sum;
}
}
}
}
```

}

And i have this parallel code that takes 3s:

```
fn floyd_version_parallel(graph: &mut Array2<f32>) {
let num_threads = rayon::current_num_threads();
let s = MAX / num_threads;
let graph_mutex = Mutex::new(graph);
for k in 0..MAX {
(0..num_threads).into_par_iter().for_each(|id| {
let mut graph = graph_mutex.lock().unwrap();
let init = s * id;
let end = s * (id + 1);
for i in init..end {
for j in 0..MAX {
let sum = graph[[i, k]] + graph[[k, j]];
if sum < graph[[i, j]] {
graph[[i, j]] = sum;
}
}
}
});
}
```

}

What i'm doing bad ? In C using openmp and the same matrix, it takes 500ms and my parallel version it takes 3s. Any ideas ? I spent two days with this and i don't know what i have to do. Thank you !