Help with parallelizing a nested loop

dineshadepu · November 24, 2018, 5:19am

Hi all. I am implementing a simple n-body simulator called Discrete element method. Here is a sample template of it

Most of the code is very similar to what I have in that file. The heavy performance time is spent in contact_force function. Here I have O(n^2) algorithm. I have seen some crates about parallelizing but had a little difficulty in applying it to nested loops.

Since my software only has these type of structures I just want to know a single way of parallelizing it, so that I can apply it every where.

Thank you.

struct DEM {
    pub x: Vec<f32>,
    pub y: Vec<f32>,
    pub fx: Vec<f32>,
    pub fy: Vec<f32>,
}

fn contact_force(d_x: &Vec<f32>, d_y: &Vec<f32>, d_fx: &mut Vec<f32>, d_fy: &mut Vec<f32>, s_x: &Vec<f32>, s_y: &Vec<f32>){
    for i in 0..d_x.len(){
        for j in 0..s_x.len(){
            let dx = d_x[i] - s_x[j];
            let dy = d_y[i] - s_y[j];
            d_fx[i] += 1e5 * dx;
            d_fy[i] += 1e5 * dy;
        }
    }
}

fn main() {
    let mut dem = DEM{
        x: vec![0.; 4],
        y: vec![0.; 4],
        fx: vec![0.; 4],
        fy: vec![0.; 4],
    };
    contact_force(&dem.x, &dem.y, &mut dem.fx, &mut dem.fy, &dem.x, &dem.y);
}

bluss · November 24, 2018, 9:28am

It looks like the loop over the i index is readily parallelizable. This is guided by the fact that your writes into vectors all depend on i, so if you split d_fx and d_fy into a chunk per job you'll be fine without additional synchronization.

I think just using the regular parallel iterators directly for d_fx, d_fy should work well.

dineshadepu · November 24, 2018, 9:48am

Thank you Bluss. Do you have any good references for parallel programming in rust. Specifically to do this type of stuff.

Thank you.

bluss · November 24, 2018, 1:14pm

Not more than the rayon docs and maybe seeing if there are good introductions to rayon out there.

This video is just an overview of this particular library. Rayon: Data Parallelism for Fun and Profit — Nicholas Matsakis - YouTube

cuviper · November 24, 2018, 4:11pm

Something like:

d_fx.par_iter_mut().zip(&mut *d_fy)
    .zip(d_x.par_iter().zip(& *d_y))
    .for_each(|((d_fx_i, d_fy_i), (d_x_i, d_y_i))| {
        // Note, the `_i` parameters are references to what would be `[i]` indexes.
        // You can `.enumerate()` the iterator if you also want that index.

        // Now write your inner loop -- could also use `par_iter()`, but it's probably fine to keep serial.
    });

dineshadepu · November 24, 2018, 7:23pm

Hi Cuviper. Thank you very much for the code (Very happy to take advice from rayon developer). I need one final correction to my code.

I added nearest neighbor search algorithms and I should call it before the inner for loop. It looks like

fn contact_force(
    d_x: &[f32],
    d_y: &[f32],
    d_fx: &mut [f32],
    d_fy: &mut [f32],
    s_x: &[f32],
    s_y: &[f32],
    s_nnps_id: usize,
    nnps: &NNPS,
) {
    for i in 0..d_x.len() {
        // get the neighbours of particle i
        let nbrs = get_neighbours(d_x[i], d_y[i], s_nnps_id, &nnps);
        for &j in nbrs.iter(){
            let dx = d_x[i] - s_x[j];
            let dy = d_y[i] - s_y[j];
            d_fx[i] += 1e5 * dx;
            d_fy[i] += 1e5 * dy;
        }
    }
}

Since the updated parallel way of writing this is

fn contact_force_par(
    d_x: &[f32],
    d_y: &[f32],
    d_fx: &mut [f32],
    d_fy: &mut [f32],
    s_x: &[f32],
    s_y: &[f32],
) {
    d_fx.par_iter_mut()
        .zip(&mut *d_fy)
        .zip(d_x.par_iter().zip(&*d_y))
        .for_each(|((d_fxi, d_fyi), (d_xi, d_yi))| {
            // Note, the `_i` parameters are references to what would be `[i]` indexes.
            // You can `.enumerate()` the iterator if you also want that index.

            // Now write your inner loop -- could also use `par_iter()`, but
            // it's probably fine to keep serial.
            s_x.iter().zip(s_y.iter()).for_each(|(s_xj, s_yj)| {
                let dx = d_xi - s_xj;
                let dy = d_yi - s_yj;
                *d_fxi += 1e5 * dx;
                *d_fyi += 1e5 * dy;
            });
        });
}

How to use the neighbour indices in the inner loop?

Full version can be found at source.

dineshadepu · November 24, 2018, 7:29pm

How can I add new arrays to the zip iterator. Say I have two additional arrays like d_m, d_r.

fn contact_force_par(
    d_x: &[f32],
    d_y: &[f32],
    d_m: &[f32],
    d_r: &[f32],
    d_fx: &mut [f32],
    d_fy: &mut [f32],
    s_x: &[f32],
    s_y: &[f32],
) {
    d_fx.par_iter_mut()
        .zip(&mut *d_fy)
        .zip(d_x.par_iter().zip(&*d_y))
        .for_each(|((d_fxi, d_fyi), (d_xi, d_yi))| {
            // Note, the `_i` parameters are references to what would be `[i]` indexes.
            // You can `.enumerate()` the iterator if you also want that index.

            // Now write your inner loop -- could also use `par_iter()`, but
            // it's probably fine to keep serial.
            s_x.iter().zip(s_y.iter()).for_each(|(s_xj, s_yj)| {
                let dx = d_xi - s_xj;
                let dy = d_yi - s_yj;
                *d_fxi += 1e5 * dx * d_mi * d_ri;
                *d_fyi += 1e5 * dy * d_mi * d_ri;
            });
        });
}

cuviper · November 24, 2018, 8:14pm

You can continue using for j in nbrs.iter() if you like.

Each zip creates a pair with the inner items, but you can keep nesting that as far as you need. You just have to keep track of where the tuple parentheses go to match your zips, and while that can get awkward, the compiler won't let you get it wrong.

If you rather, only the mutable part really needs to be explicitly parallel. The immutable parts can already be shared across the threads. You could just zip the mutable parts and use the index for the rest.

d_fx.par_iter_mut().zip(d_fy).enumerate()
    .for_each(|(i, (d_fxi, d_fyi))| {
        // Use with `d_x[i]` etc.
    });

Side note: the explicit dereferencing I was doing like &mut * isn't necessary when the argument is already a mutable slice.

dineshadepu · November 26, 2018, 2:22pm

Hello. I had implemented the parallel for loop by using neighbouring_indices. I did some performance checks. It is here (mobile users see readme.org). For some reason I think I am not getting the full parallel efficiency.

I checked htop by running long simulations, my program doesn't fully use all the cores available. Only half of each core is used. I compared it with one another software pysph, which actually uses all the cores for the same computation effort.

What am I doing wrong? Can I make my parallel loop any better? Or is this all I can get? Or in this particular
simulation do you think the work done by each thread is less and the cores usage is not full?

If this is all I got, is there any better way of doing it. I think I am loosing some 2X speed speed here.

Thank you.

cuviper · November 26, 2018, 8:39pm

I profiled your serial example with perf record --call-graph=lbr, and I see:

-   99.97%     3.67%  serial   serial            [.] serial::main
   - 96.29% serial::main
      - 50.49% prestige::contact_search::stash
         + 48.54% <T as alloc::vec::SpecFromElem>::from_elem
         + 0.70% malloc
      - 23.23% serial::contact_force
         + 17.69% prestige::contact_search::get_neighbours
           0.86% __rust_dealloc
        14.18% __rust_dealloc
      + 7.16% _int_free
   + 3.67% _start

So the contact_force you're parallelizing only accounts for 23% of the runtime. By Amdahl's law, even if you perfectly optimized/parallelized that part to take no time at all, you'd only see 1/(1-0.23) = 1.3x speedup.

In the 50% stash, most of that time is under <T as alloc::vec::SpecFromElem>::from_elem, which I believe is from the line:

    let mut cells = vec![Cell::new(world.len()); no_x_cells as usize * no_y_cells as usize];

Maybe you would benefit more from reusing and clearing the prior nnps.cells allocation?

Topic		Replies	Views
How to parallelize this code? help	21	1424	December 28, 2020
Simple syntax for parallel iterators in Rayon help	2	913	January 12, 2023
Parallelizing a for loop that writes into a vector? help	11	4272	July 14, 2020
Confusion about ```into_par_iter``` help	7	3023	January 26, 2020
Assign to an array in parallel help	10	2329	January 12, 2023

Help with parallelizing a nested loop

Related topics