Help with parallelizing a nested loop

Hi all. I am implementing a simple n-body simulator called Discrete element method. Here is a sample template of it

Source

Most of the code is very similar to what I have in that file. The heavy performance time is spent in contact_force function. Here I have O(n^2) algorithm. I have seen some crates about parallelizing but had a little difficulty in applying it to nested loops.

Since my software only has these type of structures I just want to know a single way of parallelizing it, so that I can apply it every where.

Thank you.

struct DEM {
    pub x: Vec<f32>,
    pub y: Vec<f32>,
    pub fx: Vec<f32>,
    pub fy: Vec<f32>,
}

fn contact_force(d_x: &Vec<f32>, d_y: &Vec<f32>, d_fx: &mut Vec<f32>, d_fy: &mut Vec<f32>, s_x: &Vec<f32>, s_y: &Vec<f32>){
    for i in 0..d_x.len(){
        for j in 0..s_x.len(){
            let dx = d_x[i] - s_x[j];
            let dy = d_y[i] - s_y[j];
            d_fx[i] += 1e5 * dx;
            d_fy[i] += 1e5 * dy;
        }
    }
}

fn main() {
    let mut dem = DEM{
        x: vec![0.; 4],
        y: vec![0.; 4],
        fx: vec![0.; 4],
        fy: vec![0.; 4],
    };
    contact_force(&dem.x, &dem.y, &mut dem.fx, &mut dem.fy, &dem.x, &dem.y);
}
1 Like

It looks like the loop over the i index is readily parallelizable. This is guided by the fact that your writes into vectors all depend on i, so if you split d_fx and d_fy into a chunk per job you’ll be fine without additional synchronization.

I think just using the regular parallel iterators directly for d_fx, d_fy should work well.

1 Like

Thank you Bluss. Do you have any good references for parallel programming in rust. Specifically to do this type of stuff.

Thank you.

Not more than the rayon docs and maybe seeing if there are good introductions to rayon out there.

This video is just an overview of this particular library. https://www.youtube.com/watch?v=gof_OEv71Aw

1 Like

Something like:

d_fx.par_iter_mut().zip(&mut *d_fy)
    .zip(d_x.par_iter().zip(& *d_y))
    .for_each(|((d_fx_i, d_fy_i), (d_x_i, d_y_i))| {
        // Note, the `_i` parameters are references to what would be `[i]` indexes.
        // You can `.enumerate()` the iterator if you also want that index.

        // Now write your inner loop -- could also use `par_iter()`, but it's probably fine to keep serial.
    });
2 Likes

Hi Cuviper. Thank you very much for the code (Very happy to take advice from rayon developer). I need one final correction to my code.

I added nearest neighbor search algorithms and I should call it before the inner for loop. It looks like

fn contact_force(
    d_x: &[f32],
    d_y: &[f32],
    d_fx: &mut [f32],
    d_fy: &mut [f32],
    s_x: &[f32],
    s_y: &[f32],
    s_nnps_id: usize,
    nnps: &NNPS,
) {
    for i in 0..d_x.len() {
        // get the neighbours of particle i
        let nbrs = get_neighbours(d_x[i], d_y[i], s_nnps_id, &nnps);
        for &j in nbrs.iter(){
            let dx = d_x[i] - s_x[j];
            let dy = d_y[i] - s_y[j];
            d_fx[i] += 1e5 * dx;
            d_fy[i] += 1e5 * dy;
        }
    }
}

Since the updated parallel way of writing this is

fn contact_force_par(
    d_x: &[f32],
    d_y: &[f32],
    d_fx: &mut [f32],
    d_fy: &mut [f32],
    s_x: &[f32],
    s_y: &[f32],
) {
    d_fx.par_iter_mut()
        .zip(&mut *d_fy)
        .zip(d_x.par_iter().zip(&*d_y))
        .for_each(|((d_fxi, d_fyi), (d_xi, d_yi))| {
            // Note, the `_i` parameters are references to what would be `[i]` indexes.
            // You can `.enumerate()` the iterator if you also want that index.

            // Now write your inner loop -- could also use `par_iter()`, but
            // it's probably fine to keep serial.
            s_x.iter().zip(s_y.iter()).for_each(|(s_xj, s_yj)| {
                let dx = d_xi - s_xj;
                let dy = d_yi - s_yj;
                *d_fxi += 1e5 * dx;
                *d_fyi += 1e5 * dy;
            });
        });
}

How to use the neighbour indices in the inner loop?

Full version can be found at source.

1 Like

How can I add new arrays to the zip iterator. Say I have two additional arrays like d_m, d_r.

fn contact_force_par(
    d_x: &[f32],
    d_y: &[f32],
    d_m: &[f32],
    d_r: &[f32],
    d_fx: &mut [f32],
    d_fy: &mut [f32],
    s_x: &[f32],
    s_y: &[f32],
) {
    d_fx.par_iter_mut()
        .zip(&mut *d_fy)
        .zip(d_x.par_iter().zip(&*d_y))
        .for_each(|((d_fxi, d_fyi), (d_xi, d_yi))| {
            // Note, the `_i` parameters are references to what would be `[i]` indexes.
            // You can `.enumerate()` the iterator if you also want that index.

            // Now write your inner loop -- could also use `par_iter()`, but
            // it's probably fine to keep serial.
            s_x.iter().zip(s_y.iter()).for_each(|(s_xj, s_yj)| {
                let dx = d_xi - s_xj;
                let dy = d_yi - s_yj;
                *d_fxi += 1e5 * dx * d_mi * d_ri;
                *d_fyi += 1e5 * dy * d_mi * d_ri;
            });
        });
}

You can continue using for j in nbrs.iter() if you like.

Each zip creates a pair with the inner items, but you can keep nesting that as far as you need. You just have to keep track of where the tuple parentheses go to match your zips, and while that can get awkward, the compiler won't let you get it wrong. :slightly_smiling_face:

If you rather, only the mutable part really needs to be explicitly parallel. The immutable parts can already be shared across the threads. You could just zip the mutable parts and use the index for the rest.

d_fx.par_iter_mut().zip(d_fy).enumerate()
    .for_each(|(i, (d_fxi, d_fyi))| {
        // Use with `d_x[i]` etc.
    });

Side note: the explicit dereferencing I was doing like &mut * isn't necessary when the argument is already a mutable slice.

1 Like

Hello. I had implemented the parallel for loop by using neighbouring_indices. I did some performance checks. It is here (mobile users see readme.org). For some reason I think I am not getting the full parallel efficiency.

I checked htop by running long simulations, my program doesn’t fully use all the cores available. Only half of each core is used. I compared it with one another software pysph, which actually uses all the cores for the same computation effort.

What am I doing wrong? Can I make my parallel loop any better? Or is this all I can get? Or in this particular
simulation do you think the work done by each thread is less and the cores usage is not full?

If this is all I got, is there any better way of doing it. I think I am loosing some 2X speed speed here.

Thank you.

I profiled your serial example with perf record --call-graph=lbr, and I see:

-   99.97%     3.67%  serial   serial            [.] serial::main
   - 96.29% serial::main
      - 50.49% prestige::contact_search::stash
         + 48.54% <T as alloc::vec::SpecFromElem>::from_elem
         + 0.70% malloc
      - 23.23% serial::contact_force
         + 17.69% prestige::contact_search::get_neighbours
           0.86% __rust_dealloc
        14.18% __rust_dealloc
      + 7.16% _int_free
   + 3.67% _start

So the contact_force you’re parallelizing only accounts for 23% of the runtime. By Amdahl’s law, even if you perfectly optimized/parallelized that part to take no time at all, you’d only see 1/(1-0.23) = 1.3x speedup.

In the 50% stash, most of that time is under <T as alloc::vec::SpecFromElem>::from_elem, which I believe is from the line:

    let mut cells = vec![Cell::new(world.len()); no_x_cells as usize * no_y_cells as usize];

Maybe you would benefit more from reusing and clearing the prior nnps.cells allocation?

3 Likes