Hey there, this sounds like a nifty little puzzle to solve so lets try to tackle it real quick.
OK, so first think I tried is to get rayon
in here because it is kind of the easiest way out there to parallelize iterators, I think. So after including Rayon we and adding the into_par_iter()
calls we now have:
use rayon::prelude::*;
type Datatype = f32;
type Vector = [Datatype; N];
const N: usize = 1024;
const BLOCK_SIZE: usize = 16;
const DT: Datatype = 1_f32;
fn get_vector() -> Box<Vector> {
let matrix = unsafe {
let layout = std::alloc::Layout::new::<Vector>();
let ptr = std::alloc::alloc_zeroed(layout) as *mut Vector;
Box::from_raw(ptr)
};
matrix
}
fn main() {
let mut xpos = get_vector();
let mut ypos = get_vector();
let mut zpos = get_vector();
let mut forcesx = get_vector();
let mut forcesy = get_vector();
let mut forcesz = get_vector();
(0..N).into_par_iter().step_by(BLOCK_SIZE).for_each(|ii| { //I need to parallelize this loop
(0..N).into_par_iter().for_each(|j| {
(ii..ii + BLOCK_SIZE).for_each(|i| {
let dx = xpos[j] - xpos[i];
let dy = ypos[j] - ypos[i];
let dz = zpos[j] - zpos[i];
forcesx[i - ii] += 5_f32;
forcesy[i - ii] += 5_f32;
forcesz[i - ii] += 5_f32;
});
});
});
}
But now we run into some compiler errors:
error[E0596]: cannot borrow `forcesx` as mutable, as it is a captured variable in a `Fn` closure
--> src/main.rs:30:44
|
30 | (ii..ii + BLOCK_SIZE).for_each(|i| {
| ^^^ cannot borrow as mutable
...
35 | forcesx[i - ii] += 5_f32;
| ------- mutable borrow occurs due to use of `forcesx` in closure
What it's telling us here is that we can't have a mutable reference to forcesx
in all of the different parallel iterations at the same time, because that would mean multiple threads mutating the same data at the same time which can cause data races. Beautiful Rust and Rayon, it won't let us cause data races.
So we have to provide a way to have only one mutable reference to our forces. A good way ( I think ) to do this is with channels. We spawn one thread that we give ownership of our forces, then in our parallel loops that are doing the computations, we send that thread that owns our forces commands that mutate the forces. So the forces will be mutated sequentially ( by necessity to prevent data races and undefined behavior ), but the data is computed in parallel.
We're going to use crossbeam channels because they are popular fast, and well featured, but you could probably use other channels as well. This takes a little bit more work, but it's not too bad:
use rayon::prelude::*;
use crossbeam_channel::{bounded, unbounded};
type Datatype = f32;
type Vector = [Datatype; N];
const N: usize = 1024;
const BLOCK_SIZE: usize = 16;
const DT: Datatype = 1_f32;
fn get_vector() -> Box<Vector> {
let matrix = unsafe {
let layout = std::alloc::Layout::new::<Vector>();
let ptr = std::alloc::alloc_zeroed(layout) as *mut Vector;
Box::from_raw(ptr)
};
matrix
}
#[derive(Debug)]
enum ForceCommand {
AddX(usize, f32),
AddY(usize, f32),
AddZ(usize, f32),
}
fn main() {
let mut xpos = get_vector();
let mut ypos = get_vector();
let mut zpos = get_vector();
let mut forcesx = get_vector();
let mut forcesy = get_vector();
let mut forcesz = get_vector();
// Create channel for sending force mutations
let (cmd_sender, cmd_receiver) = unbounded();
// Create a channel for getting the resultant forces
let (res_sender, res_receiver) = bounded(1);
// Spawn a thread for mutating the forces
std::thread::spawn(move || {
for message in cmd_receiver {
match message {
ForceCommand::AddX(index, value) => forcesx[index] += value,
ForceCommand::AddY(index, value) => forcesx[index] += value,
ForceCommand::AddZ(index, value) => forcesx[index] += value,
}
}
res_sender.send((forcesx, forcesy, forcesz)).unwrap();
});
(0..N).into_par_iter().step_by(BLOCK_SIZE).for_each(|ii| {
(0..N).into_par_iter().for_each(|j| {
(ii..ii + BLOCK_SIZE).for_each(|i| {
// Clone the sender so each iteration gets its own
let cmd_sender = cmd_sender.clone();
let dx = xpos[j] - xpos[i];
let dy = ypos[j] - ypos[i];
let dz = zpos[j] - zpos[i];
cmd_sender.send(ForceCommand::AddX(i - ii, 5_f32)).unwrap();
cmd_sender.send(ForceCommand::AddY(i - ii, 5_f32)).unwrap();
cmd_sender.send(ForceCommand::AddZ(i - ii, 5_f32)).unwrap();
});
});
});
// Drop the sender: this makes sure that the thread that is waiting for
// force mutation commands will stop looping when it finds out that all
// senders have been droped and that there is no more mutation commands
// to wait for.
drop(cmd_sender);
let (forcesx, forcesy, forcesz) = res_receiver.recv().unwrap();
}
Now we've got it parallel! Take note, that I'm not sure if all of the parallelism will actually make this faster. You will have to test it in your application. For instance, I put two parallel iterators in that example, but it might be more efficient to do just one of the iterators parallel so there is less context switching.
Hope this helps!