Speeding up parallel SIMD simulation

darley · July 28, 2021, 4:37pm

Hi folks

I managed to parallelize a code that does a simulation, but there's clones and allocations everywhere. Would any of you mind in giving any tips here? I will paste the required pieces of code to understand the basics.

use std::sync::mpsc::channel;
use wide::f64x4;
use threadpool::ThreadPool;

#[derive(Clone)]
pub struct Particle {
    position: f64x4,
    velocity: f64x4,
    acceleration: [f64x4; 2],
    mass: f64,
}

pub struct Bodies {
    particles: Vec<Particle>,
    pool: ThreadPool,
    accs: Vec<Vec<f64x4>>,
    splits: Vec<(usize, usize)>,
}

//! Make splits to separate slices of the particles
fn make_splits(n: usize, k: usize) -> Vec<(usize, usize)> {
    let iter = (0..1).chain(
        (0..k)
            .map(|i| f64::ceil((n as f64) * (1. - f64::sqrt((i as f64) / (k as f64)))) as usize)
            .rev(),
    );

    iter.clone().zip(iter.skip(1)).collect()
}

impl Bodies {
    pub fn new(path_name: &str) -> Bodies {
        let file_path = Path::new(path_name);
        let particles: Vec<Particle> = ...;

        let k = num_cpus::get();
        let n = (&particles).len();
        let splits = make_splits(n, k);
        let accs = vec![vec![f64x4::splat(0.0); n]; k];

        Bodies {
            particles,
            pool: ThreadPool::new(k),
            accs,
            splits,
        }
    }

    pub fn accelerate(&mut self) {
        let (tx, rx) = channel();

        // I wonder if I can get rid of these allocations
        let positions: Vec<f64x4> = self.particles.iter().map(|p| p.position).collect();
        let masses: Vec<f64> = self.particles.iter().map(|p| p.mass).collect();

        for (i, split) in self.splits.iter().enumerate() {
            let tx = tx.clone();
            let split = *split;
            // Do these clones hurt performance?
            let positions = positions.clone();
            let masses = masses.clone();

            self.pool.execute(move || {
                let acc_i = accelerated_chuncked(split, &positions, &masses);
                tx.send((i, acc_i)).unwrap();
            });
        }

        let k = self.splits.len();

        // Waiting for the values and replacing them without allocaction of a new Vec
        for (i, acc_k) in rx.iter().take(k) {
            self.accs[i]
                .iter_mut()
                .zip(acc_k.iter())
                .for_each(|(e, a)| *e = *a);
        }

        // Transposing Vec<Vec<f64x4>> from [K;N] to [N;K] and summing over K for each N
        let accs = &self.accs;
        let accs_par = (0..accs[0].len())
            .map(|i| accs.iter().map(|row| row[i]).reduce(|a,b| a + b));

        // Replacing new value of acc and storing the previous one
        // This is the aggregation step.
        self.particles
            .iter_mut()
            .zip(accs_par)
            .filter(|(p, a)| a.is_some())
            .for_each(|(p, a)| {
                p.acceleration[1] = p.acceleration[0];
                p.acceleration[0] = a.unwrap();
        });
    }
}

fn accelerated_chuncked(split: (usize, usize), pos: &[f64x4], mass: &[f64]) -> Vec<f64x4> {
    let N = pos.len();
    // Can we manage to do it without allocating more?
    let mut acc = vec![f64x4::splat(0.0); N];

    for i in split.0..split.1 {
        let pos_i = pos[i];
        let m_i = mass[i];

        for j in i + 1..N {
            let mut dr = pos_i - pos[j];
            let rinv3 = (1. / (dr*dr).reduce_add().sqrt()).powi(3);
            dr = dr * rinv3;
            acc[i] -= mass[j] * dr;
            acc[j] += m_i * dr;
        }
    }
    acc
}

I am thinking on removing the Particle struct and leaving 3 Vec<f64x4> and 1 Vec<f64> so I can avoid some allocations by cloning an iterator over them. But idk if it's a good idea.

Thanks in advance!

quinedot · July 28, 2021, 7:57pm

Consider using scoped threads to avoid some clones in accelerate. You can then also avoid the outer allocations by having accelerated_chunks take a &[Particle], probably.

matklad · July 28, 2021, 11:06pm

Not an expert, but I think the common advise is “don’t use simd to speed up a single operation, use simd to do many operation at once”, and the starting point there is usually SoA data layout. That is, rather than using f64x4 for position, unzip all positions into x: Vec<f64>, y: Vec<f64>, z: Vec<f64>.

This post might come in useful: Optimising path tracing with SIMD · shifting bits

tuxmain · July 29, 2021, 7:21am

Another thing to try is to remove all the bound checks in the loops. You can check them before the loop so the compiler won't check them at each iteration. (I see the problem for pos, mass and acc in accelerated_chunked) When I did this in my simulation, it became 2× faster. (it also enables vectorisation in some cases, but as said in the previous comment, with SIMD you have to vectorize yourself, or else SIMD will not help at all)

You can use f32 instead of f64 if it's enough precision for you, this way it will be 2× faster.

Here is my simulation in case it helps, but the code is much dirtier and unsafer than yours.

darley · July 29, 2021, 11:23am

Generally speaking, how would I avoid bound checking?

elsuizo · July 29, 2021, 1:53pm

Before all the fancy things you need to improve the basic code, for example you don't need generate the splits, you could do that with a lazy iterator.

this is also redundant:

 let positions: Vec<f64x4> = self.particles.iter().map(|p| p.position).collect();
 let masses: Vec<f64> = self.particles.iter().map(|p| p.mass).collect();

could you provide a working code without threads, simd ???

darley · July 29, 2021, 2:07pm

Of course, I have it here.

EDIT: Ignore the reduce_add calls, I copied/pasted wrong from other pieces of code and I removed them already in the local repo.

tuxmain · July 29, 2021, 4:03pm

If you use my_slice[i] when iterating over i in 0..my_length, you can add this before the loop:

assert!(my_length <= my_slice.len());

The compiler should then understand it's safe to use my_slice[i].

In general, for everything that may fail (slice index, arithmetic ops), find redundant assertions and try replacing them with fewer assertions.

// This is redundant:
my_slice[0];// here we assert 0 < my_slice.len()
my_slice[1];// here we assert 1 < my_slice.len()

// This should be fine:
my_slice[1];// here we assert 1 < my_slice.len()
my_slice[0];// nothing to assert, since we know 0 <= 1 < my_slice.len()
// should be equivalent to:
assert!(1 < my_slice.len());
my_slice[0];
my_slice[1];

If the indices are something like n and n+1 instead of 0 and 1, then you need one more assertion (like n < usize::MAX), because n < n+1 is not true when n+1 overflows.

(however for some reason I don't understand, it doesn't work at all times. Compiler Explorer can be useful)

darley · July 29, 2021, 4:18pm

Wow, I thought putting assert!(my_length <= my_slice.len()); would degrade a bit performance. That's great to know, Thanks!

elsuizo · July 29, 2021, 4:31pm

you should do this two methods in one pass:

    pub fn advance_velocities(&mut self, dt: f64) {
        self.particles
            .iter_mut()
            .for_each(|p| p.velocity += 0.5 * dt * (p.acceleration[0] + p.acceleration[1]));
    }

    pub fn advance_positions(&mut self, dt: f64) {
        self.particles
            .iter_mut()
            .for_each(|p| p.position += dt * p.velocity + 0.5 * dt.powi(2) * p.acceleration[0]);
    }

like so:

    pub fn advance_positions_and_velocities(&mut self, dt: f64) {
        self.particles.iter_mut().for_each(|p| {
            p.velocity += 0.5 * dt * (p.acceleration[0] + p.acceleration[1]);
            p.position += dt * p.velocity + 0.5 * dt * dt * p.acceleration[0];
        });
    }

and the parse function could be written:

pub fn parse_row(line: &str) -> Particle {
    let row_vec: Vec<f64> = line
        .split_whitespace()
        .skip(1)
        .filter_map(|s| f64::from_str(s).ok())
        .collect();

    Particle {
        position: Vec3D(row_vec[1], row_vec[2], row_vec[3]),
        velocity: Vec3D(row_vec[4], row_vec[5], row_vec[6]),
        acceleration: Default::default(),
        mass: row_vec[0],
    }
}

darley · July 29, 2021, 6:10pm

Actually I have to update the positions first, than the accelerations and than the velocities. So these two cannot be updated at the same time.

jorgecarleitao · July 30, 2021, 9:52am

Hi,

My understanding is as @matklad wrote as well: using a single f64x4 to identify a velocity vector does not offer SIMD benefits because Vec<Particle> has no suitable alignment for SIMD.

What I would do is to align the velocities of every particle in a single memory region, so that the operation pos += vec * t can be vectorized over all particles (or split in chunks for multi-threading, but same idea).

I.e. something like

positions = vec![0.0; n_particles * 3]
velocities = vec![1.0; n_particles * 3]
acceleration = vec![1.0; n_particles * 3]

// euler method, but you get the point
velocities.iter_mut().zip(acceleration.iter()).for_each(|(v, a)| *v += a * dt)
positions.iter_mut().zip(velocities.iter()).for_each(|(p, v)| *p += v * dt)

I think that the culprit of the code is the Particle struct, that is destroying the memory alignments that benefit SIMD across particles.

The other optimization is to not allocate vec on every iteration, but have a vec used to store intermediary terms, and pass them to the advance as &mut [u8].

I.e. I would write the advance as something like

fn advance_euler(positions: &mut [f64], velocities: &mut [f64], acceleration: &[f64]);
fn update_acceleration(positions: &[f64], velocities: &[f64], acceleration: &mut [f64]);

(note where the mut are).

For multi-threading, split these vectors on a "particles per core" and apply the same principle. Just need to be mindful about the update_acceleration that needs some thread syncronization.

darley · July 30, 2021, 11:44am

Thank you for your thorough response!

Doing a vec![default; n_particles * 3] would imply a usage of chunks and eventually iterate over them in update_acceleration. Would this be optimized away? Or would the compiler still performs bounds checking?

jorgecarleitao · July 30, 2021, 12:13pm

In Arrow we have been using this e.g. here, and basic operations like + and the like are optimized without additional effort (i.e. without using chunks_exact). But in doubt, check the compiled code, e.g. Compiler Explorer shows it being compiled to XMM instructions, which AFAI know are SSE (a SIMD extension).

There is probably further gains via chunks_exact + packed_simd2, but the overall idea remains: align the memory so that a loop over it can be mapped to a loop over SIMD instructions.

darley · July 30, 2021, 2:08pm

I think I am missing something. Because I can't simple iterate over them in the case of update_acceleration:

let mut particles = self.particles.as_mut_slice();

while let Some((p1, rest)) = particles.split_first_mut() {
    particles = rest;

    for p2 in particles.iter_mut() {
        let vector = &p1.position - &p2.position; //vector is f64x4/Vec3
        let distance = vector.sum_squares().sqrt(); //distance is f64
        let distance_cube = distance.powi(3);

        p1.acceleration[0] -= (p2.mass / distance_cube) * vector; //acceleration is f64x4/Vec3
        p2.acceleration[0] += (p1.mass / distance_cube) * vector;
    }
}

There's a mix between scalar and non-scalar. Looking at the docs, the function that would be perfect here would be as_chunks_mut, but it is unstable as of now.

jorgecarleitao · July 30, 2021, 5:01pm

That code continues to use "particles" as structs, which do not offer the necessary alignment for SIMD when updating positions and velocities. I think that you may be trying to impose an object-oriented model to the problem.

Conceptually they are particles and objects are good to reason about, but numerically (in how they are represented in memory for a computer) it is usually more performant to align them for vectorised operations.

If I understand correctly, this is the N-body problem. This is how I would do it:

fn update_acc(positions: &[f64], velocities: &[f64], m: &[f64], acceleration: &mut [f64]) {
    let grav_constant: f64 = 6.67408 * 10.0f64.powi(-11);
    let n = m.len();

    let pos = positions.chunks_exact(3);
    let vel = velocities.chunks_exact(3);
    // a key point here is that accelerations do not depend on each other; thus, Rust works
    // because we only need to borrow mutable (afaik this invariant holds in classical mechanics)
    let acc = acceleration.chunks_exact_mut(3);

    // loop over particle `i`
    acc.zip(vel)
        .zip(pos)
        .enumerate()
        .for_each(|(i, ((a, _), pos))| {
            // loop over particle `j > i`
            let pos_others = positions.chunks_exact(3).skip(1 + i);
            let vel_others = velocities.chunks_exact(3).skip(1 + i);
            let mass_others = m.iter().skip(1 + i);

            let new_acc = vel_others.zip(pos_others).zip(mass_others).fold(
                [0.0f64; 3],
                |mut acc, ((_, other_pos), other_mass)| {
                    let r0 = other_pos[0] - pos[0];
                    let r1 = other_pos[1] - pos[1];
                    let r2 = other_pos[2] - pos[2];
                    let distance_3_over_2 = (r0.powi(2) + r1.powi(2) + r2.powi(2)).sqrt().powi(3);
                    let constant = other_mass * grav_constant / distance_3_over_2;
                    acc[0] -= r0 * constant;
                    acc[1] -= r1 * constant;
                    acc[2] -= r2 * constant;
                    acc
                },
            );
            // update acc of particle i
            a[0] = new_acc[0];
            a[1] = new_acc[1];
            a[2] = new_acc[2];
        })
}

fn main() {
    let n = 2;

    let mut p = vec![0.0f64; 3 * n];
    let v = vec![0.0f64; 3 * n];
    let mut a = vec![0.0f64; 3 * n];
    let mut mass = vec![1.0f64; n];

    // place an object close to earth, on the x axis
    p[3] = 6_371_000.0f64; // ~earths' radius
    mass[0] = 5.972 * 10.0f64.powi(24);  // ~earths' mass

    update_acc(&p, &v, &mass, &mut a);

    // 9.8g on the earths' surface is about right
    assert!((a[0] - (-9.8)).abs() < 0.1);
}

We could use f64x3 instead of the indices above; yet, note that there is no bound checks on these indices because either chunks_exact proves boundness or the indices are on the stack.

The relevant part in all of this is not the formula to update the acc but the fact that acc, vec and pos are all aligned as [x1, y1, z1, x2, y2, z2, ...], which makes it very easy to apply SIMD on.

I left the iterator over the velocities to keep it general, but in energy-conserving systems like the N-body problem the acceleration does not depend on them, so we can ignore that part here.

darley · July 30, 2021, 7:09pm

That's definitely something far beyond I was imagining! Thank you so much! I will put some thinking into this. Once more, thanks!

darley · July 30, 2021, 11:15pm

Hmm, I don't think they are doing the same thing.
Rust playgroung.

The acc_me updates two particles at a time. Perhaps you were thinking on a slightly different problem. This one is called NBabel.

scottmcm · July 31, 2021, 3:17am

Generally what you need is to make it as clear as possible to LLVM that all the slices are the same length, and then it'll remove the bounds checks.

Here's a quick example:

Conveniently asserting preconditions before a loop is helpful both for humans and the optimizer.

Here's my go-to example of that, replacing three bounds checks with a single length check:

darley · July 31, 2021, 1:47pm

Yeah, I saw a big change in the asm there! Good to know about this. Thank you!

Topic		Replies	Views
Rayon & Packed SIMD accumulator pattern help	2	458	March 17, 2021
SIMD vs Scalar Nbody help	3	441	March 16, 2021
Help for my parallel sum help	34	4296	September 19, 2019
Faster Progress Report 2: Up to 600% speedups for popular crates, and our biggest release ever announcements	5	1312	January 12, 2023
Can this be parallelized? help	13	606	September 3, 2023

Speeding up parallel SIMD simulation

Related Topics