Optimization of Algorithm to Fill Empty Space

Hello!

GOAL: Code review of single core implementation of algorithm currently running 100 time steps in 1 s. Goal is to if possible reach 100 time steps in 0.2 s.

I am a pretty new user, started about 3 days ago and would very much like a Code Review of the first thing I have made in Rust. It is an algorithm based on a paper found here:

https://www.researchgate.net/publication/256688486_Particle_packing_algorithm_for_SPH_schemes

The specific algorithm it self is not interesting, just wanted to leave the reference for those with a deeper interest. The output of the algoritm will be a filled square as such:

Initially the particles are distributed as.. (was only allowed 1 picture in the post, but imagine uniform circular distribution of particles inside the square)

So basically the algorithm fills out empty space.

I have uploaded my code: GitHub - AhmedSalih3d/particlepacking-rust at initial

Please clone from the initial branch since it holds an unchanged version. The code consists of roughly five files:

  • algorithm.rs - The most important one in this case, it implements the algorithm I want to increase performance of
  • dataio.rs - Simple way to read the data points in "RustCircleUniformGridSQUARE.txt"
  • main.rs - Runs the algorithm
  • plotting.rs - Simple way to plot the result as svg, which is saved as "Points.svg". Can be visualized using Chrome etc.
  • point.rs - Implementation of a simple "Point" structure, consisting 3 f32. Overloading of operators for easier math and println! functionality performed in there.

Now I will show the most relevant parts of the project i.e. the parts I want specifically to be optimized.

main.rs
In here I initialize four vectors of my custom struct "Point". The data is based on the text file mentioned in FILENAME. The reason I have to have two of each i.e. by cloning, will be shown in the next code snippet. For easing readability I pass these four vectors by reference to a struct. NCUT is used to make it so that the particles forming the square are not computed (reference first picture). I then use a timing functionality to time the performance of my loop. THIS is the main part I want to optimize.

fn main() {
    const FILENAME:&str = "RustCircleUniformGridSQUARE.txt";
    let mut pini = dataio::readpoints(FILENAME);
    let mut uini = genvelocity(&pini);
    let mut ptmp = pini.clone();
    let mut utmp = uini.clone();

    let mut data = point::Particles { pvec: &mut pini, 
                                      uvec: &mut uini,  
                                      ptmp: &mut ptmp,
                                      utmp: &mut utmp
                                };

    const NCUT: usize = 800;
    let now = Instant::now();
    for _ in 0..100 {
        algorithm::packstep_s(&mut data, NCUT);
    }
    println!("{}", now.elapsed().as_millis());

    const NAME:&str = "Points.svg";
    // How to transfer ownership correctly..
    plotting::plot_points(&pini, NAME);
}

algorithm::packstep_s(&mut data, NCUT);
This function takes the "data" struct which holds references to all the vectors and ncut, with the purpose on running the algorithm on every particles from index 0 to n - ncut. When a calculation for all particles are done, it finally clones the tmp results into the correct vector. This is to ensure that all calculations are performed at same time step of data. The function packstep_single(data,i) is shown in next snippet below this one.

pub fn packstep_s(data: &mut point::Particles<'_>,
                  ncut: usize) {
    let n = data.pvec.len();

    for i in 0..(n - ncut) {
        packstep_single(data,i);
    }

    data.pvec[..(n - ncut)].clone_from_slice(&data.ptmp[..(n - ncut)]);
    data.uvec[..(n - ncut)].clone_from_slice(&data.utmp[..(n - ncut)]);
}

packstep_single(data: &mut point::Particles<'_>, iter: usize)
This is the algorithm where the heavy calculation happens - and this is my primary focus for optimization. I have defined a constant struct "AC", which holds constants to ease implementation and then the algorithm is carried out as shown below.

fn packstep_single(data: &mut point::Particles<'_>, iter: usize){
    let mut wgx = 0.0;
    let mut wgy = 0.0;
    let mut wgz = 0.0;
    let p_i = data.pvec[iter];

    for (j, p_j) in data.pvec.iter().enumerate() {
        if j != iter {
            let rij = p_i - *p_j;
            let mut rij2 = (rij * rij).sum();
            if rij2 <= AC::H2 {
                rij2 = rij2.sqrt();
                let rij1 = 1.0 / (rij2);
                let q = rij2 * AC::H1;
                let q1 = q - 2.0;
                let qq3 = q * q1 * q1 * q1;
                let wq = AC::AD * AC::FAC * qq3;

                wgx += wq * (rij.0 * rij1) * AC::H1;
                wgy += wq * (rij.1 * rij1) * AC::H1;
                wgz += wq * (rij.2 * rij1) * AC::H1;
            }
        }
    }

    let u_i = data.uvec[iter];

    let dux = (-AC::BETA * wgx * AC::V - AC::ZETA * u_i.0) * AC::DT;
    let duy = (-AC::BETA * wgy * AC::V - AC::ZETA * u_i.1) * AC::DT;
    let duz = (-AC::BETA * wgz * AC::V - AC::ZETA * u_i.2) * AC::DT;

    data.utmp[iter] = u_i + point::Point(dux, duy, duz);
    data.ptmp[iter] = p_i + point::Point(dux * AC::DT, duy, duz * AC::DT);
}

So this was a fast overview of the most important functionality, now my hope is that I have intrigued some of you enough that you would like to give it a look. I basically compile it using:

RUSTFLAGS="-Ctarget-cpu=native" cargo run --release

Which when the program is run gives a performance of ~1 second, ie. 1000 ms which is printed out in the terminal. Since I run 100 time steps, this gives me a "FPS" ratio of 100. I would like this to be around 500 and I am sure some more advanced Rust users can help me achieve this.

What Have I done Already

  1. I have run cargo clippy and implemented all suggestions.
  2. Spent time to implement it to the best of my ability. I have already implemented it in Matlab, Python, C++, Julia - especially Julia shows promising results but Rust is also very good already with my naive implementation.
  3. Implemented a basic Cargo.toml based on default suggestions for release

What can be done

Other than improving single core performance it is possible to implement for example threading. This would be done by calling packstep_single on different threads using different values of i, in other words particle index. I was not succesful in doing this since I met a lot of problems with shared reference in data struct etc. An other approach would be async, since the order of calculation does not matter in my setup.

Other ideas would be to implement neighbour search algorithms so I only have to look at few particles at a time - if you know any such libraries in Rust do please share.

Finally one could turn to the GPU, but I think that it is a bit too advanced for now. Currently I am primarily interested in ensuring that I have pushed single core performance to the max and then implement threading - would love to have some help in regards to that.

Kind regards, thanks for reading if you made it all the way through

For threading, what libraries in rust did you try to use? If you've just used std, I highly recommend trying out rayon.

It adds some abstractions which make it pretty easy to work with shared data structures, like par_iter(). I don't know how much of a speedup it would bring, it can't hurt to try.

The biggest difference in using rayon vs. a regular for loop is that you can't modify the shared variables, like wgx/y/z. Since it's a simple sum of the different iterations, though, I think you could do a functional style map then reduce to sum them. Something like this:

use rayon::prelude::*;

let (wgx, wgy, wgz) = data
    .pvec
    .par_iter()
    .enumerate()
    .map(|(j, p_j)| {
        if j != iter {
            // ... do calculation ...
            // return values to sum from closure
            return (
              /* wgx computation */,
              /* wgy computation */,
              /* wgy computation */,
            );
        }
        // if we don't do the calculation, just contribute '0' to the sum
        (0.0, 0.0, 0.0)
    })
    .reduce(|| (0.0, 0.0, 0.0), |a, b| (a.0 + b.0, a.1 + b.1, a.2 + b.2));

Let me know if that gives a reasonable speedup? If other people find other things, they could probably be combined as well.

Also worth noting that this will change the order in which the f32 values are summed up, so that could change the result in small ways because of non-associativity. If you need to add them in a particular order, then there should be something that can be used instead of reduce to get the results in the right order before summing them one by one.

1 Like

EDIT: I was being a dummy got it to work, with following code statement:

fn packstep_single(data: &mut point::Particles<'_>, iter: usize){
    
    let p_i = data.pvec[iter];

    let (wgx, wgy, wgz) = data
    .pvec
    .par_iter()
    .enumerate()
    .map(|(j, p_j)| {
        let rij = p_i - *p_j;
        let mut rij2 = (rij * rij).sum();
        if j != iter {
            if rij2 <= AC::H2 {
                rij2 = rij2.sqrt();
                let rij1 = 1.0 / (rij2);
                let q = rij2 * AC::H1;
                let q1 = q - 2.0;
                let qq3 = q * q1 * q1 * q1;
                let wq = AC::AD * AC::FAC * qq3;
                return
                (wq * (rij.0 * rij1) * AC::H1,wq * (rij.1 * rij1) * AC::H1,wq * (rij.2 * rij1) * AC::H1)
            }
        }
        // if we don't do the calculation, just contribute '0' to the sum
        (0.0, 0.0, 0.0)
    })
    .reduce(|| (0.0, 0.0, 0.0), |a, b| (a.0 + b.0, a.1 + b.1, a.2 + b.2));


    let u_i = data.uvec[iter];

    let dux = (-AC::BETA * wgx * AC::V - AC::ZETA * u_i.0) * AC::DT;
    let duy = (-AC::BETA * wgy * AC::V - AC::ZETA * u_i.1) * AC::DT;
    let duz = (-AC::BETA * wgz * AC::V - AC::ZETA * u_i.2) * AC::DT;

    data.utmp[iter] = u_i + point::Point(dux, duy, duz);
    data.ptmp[iter] = p_i + point::Point(dux * AC::DT, duy, duz * AC::DT);
}

Unfortunately the performance is not very good currently since it takes 6000 ms.

--OLD MESSAGE--
Hello @daboross!

Thank you very much for your interest and example code. I have tried to implement it:

let p_i = data.pvec[iter];

    let (wgx, wgy, wgz) = data
    .pvec
    .par_iter()
    .enumerate()
    .map(|(j, p_j)| {
        if j != iter {
            let rij = p_i - *p_j;
            let mut rij2 = (rij * rij).sum();
            if rij2 <= AC::H2 {
                rij2 = rij2.sqrt();
                let rij1 = 1.0 / (rij2);
                let q = rij2 * AC::H1;
                let q1 = q - 2.0;
                let qq3 = q * q1 * q1 * q1;
                let wq = AC::AD * AC::FAC * qq3;

                wgx = 2.0;
                wgy = 2.0;
                wgz = 2.0;
            }
            return
            (wgx,wgy,wgz)
            ;
        }
        // if we don't do the calculation, just contribute '0' to the sum
        (0.0, 0.0, 0.0)
    })
    .reduce(|| (0.0, 0.0, 0.0), |a, b| (a.0 + b.0, a.1 + b.1, a.2 + b.2));

But I get the following error:

error[E0425]: cannot find value `wgz` in this scope
  --> src/algorithm.rs:55:21
   |
55 |                     wgz += wq * (rij.2 * rij1) * AC::H1;
   |                     ^^^ not found in this scope

error: aborting due to 3 previous errors

For more information about this error, try `rustc --explain E0425`.
error: could not compile `a`.

And the same for all wgx/y/z. Am a bit new to Rayon so I struggle a bit with what to change to fix it. It is a scoping problem

Kind regards

1 Like

Not a performance-related point but I think your Particles<'a> struct is unidiomatic as you are introducing double indirection (&mut Particles<'a>) to use it. I would make it an owned struct and rewrite main as follows:

// point.rs
pub struct Particles {
    pub pvec: Vec<Point>,
    pub uvec: Vec<Point>,
    pub ptmp: Vec<Point>,
    pub utmp: Vec<Point>
}

// main.rs
    let points = dataio::readpoints(FILENAME);
    let velocities = genvelocity(&points);

    let mut data = point::Particles { pvec: points.clone(), 
                                      uvec: velocities.clone(),
                                      ptmp: points
                                      utmp: velocities
                                };
...

    plotting::plot_points(&data.pvec, NAME);

I'll investigate more.

1 Like

You can experiment with different values for RAYON_NUM_THREADS environment variable to see if that affects performance. I'd start with comparing the single-threaded implementation, and threaded implementation with 1 and 2 threads

I've been thinking about the algorithm more, and have a possible suggestion.

It looks like in every iteration inside packstep_single, the computation you're doing is 100% determined by the value in data.pvec. But this never changes iteration-to-iteration.

What if you did all that math once, stored it in something like a Vec<(f32, f32, f32)>, then did those last computations n times rather than duplicating the whole calculation only to exclude a single item?

Going even further, I think you could just sum them up with all the items once, and then for each packstep_single, simply subtract the calculated wgx,y,z for data.pvec[j] and do those final computations.

Thoughts?

Hello @pcpthm!

Thank you for your suggestion I really like it. I ended up doing as such:

let pini = dataio::readpoints(FILENAME);
let uini = genvelocity(&pini);
let ptmp = pini.clone();
let utmp = uini.clone();

let mut data = point::Particles { pvec: pini, 
                                  uvec: uini,  
                                  ptmp: ptmp,
                                  utmp: utmp
                            };

Where I transformed the struct as you suggested - made it a lot more readable and now I cannot use "pinI" anymore when doing the plot which I like since it is properly owned by the struct "data" now.

Kind regards

1 Like

Hello!

I used this command in main:

rayon::ThreadPoolBuilder::new().num_threads(8).build_global().unwrap();

And unfortunately it didn't affect performance in any good way - using less threads would make result better. This indicates that the calculation are "too fast" to benefit from threading right now is my assumption. I tested on a data set with 50000 points, compared to 3500 in original post, and still did not make a difference.

Kind regards

Hmm, I have tried to stirr away from arrays and use for loops since I have found the code easier to maintain, but I guess it is worth a try.

Basically you are right that for every call in packstep_single(data,iter), then I would look at particle, p[iter], and compute how it behaves when compared with all other particles, in pvec.

I guess one could then as you suggest instead of using a loop, use principles of linear algebra to say this rij = p[iter] - pvec (tuple), say, rij2 = (rij*rij).sum() (r^2 - f32) set every value which does not survive, if rij2 <= H2, to zero, then take sqrt over every element, multiply with AC::H1 to get q and then do the rest..

The reason I have avoided arrays/vectors as the "plague" is due to my bad experience in Julia and Matlab. I would love to see an example of how to use arrays/vectors/slices in Rust for this case, but I think the performance would be worse and I would have to apply "map" in some smart way to get the same performance. Would love to be proven wrong though.

Kind regards

I think this is probably a good idea.

I tried to implement my suggestion, and found that I had misread the code, and it was not as simple as I thought! Sorry for making a bad recommendation.

Thanks for the well thought out response, though!

I now understand your code.

The performance critical loop is

    for (j, p_j) in data.pvec.iter().enumerate() {
        if j != iter { // (1)
            let rij = p_i - *p_j;
            let mut rij2 = (rij * rij).sum();
            if rij2 <= AC::H2 { ...  } // (2)
       }
    }

I counted how much proportion the if (2) taken. And the result is the if is not taken order of magnitude often. Thus, the strategy of optimizing inside the if block is an order of magnitude meaningless and rather we want to reduce discarded loop as much as possible.

I recognize this problem as Nearest neighbor search (NNS) (in the 3D euclidean metrics). In particular, your points are static within a packstep_s call and thus we can build a static data structure of fast "find all points within distance r of the query point". Furthermore, the problem is more restricted as all radii rs are the same AC::H2 and also all your query points are given in advance (this is called an offline problem) though I don't know the offline NNS can be solved faster than online NNS.

The NNS problem can be solved by various algorithms. Space partitioning methods such as k-d-tree is one of the most famous ones. I'm pretty sure there are existing implementation crates in Rust you can use. I found kdtree, fux_kdtree. You can research more (edit: kdtree crate is not fast as only ~1.3x spedup. fux_kdtree crate doesn't provide "get all close points" API).

I implemented a simple pruning method and I can already see 5x speed up if the exact floating-point sum order is not preserved or 3x speed up to output the same result https://gist.github.com/pcpthm/fa72d57d33e3691c8a81aba9542eb400.

No problem! Thank you for spending time and trying to come with suggestions.

Kind regards

I see, so basically there is not much more to do to optimize other than using NB search etc.

Could you hint me to where to read a bit more about your simple pruning method? In the end I will probably go with one of the packages, would just like to learn. Thanks for helping me out with a functional example.

I saw 70% performance increase using your algorithm.rs.

Kind regards

Sorry, the pruning method (sort the points by a coordinate (or a random direction) and upper bound the distance by 1D distance) should be well-known to the computational geometry people but I cannot find the reference to the method. Maybe it can be called just "the divide and conquer algorithm of the closest pair problem minus divide and conquer".

Also I couldn't find a good Rust NNS crate but someone else could answer this question. Good luck.

I see, thank you very much for your help. Will have to dive deeper into it.

Kind regards

It turns out the problem (offline, fixed radius neighbor search) can exactly be described as unit disk graph construction. According to #Computational_complexity, it is possible to solve the problem in linear time (as a function of the output size). Could the algorithm described here (grid hashing based) faster in practice? It needs an investigation.

1 Like

I like the idea, but I think this algorithm will turn out to be quite slow in the end, due to having calculating the intersections between circles when finding neighbours. What I have trying to implement in the past is based on this small guide:

http://cacs.usc.edu/education/cs596/01-1LinkedListCell.pdf

The basic idea is as follows:

Split the domain into different cells and figure out which cell each particle lives in. The construct a header, which holds the "last particle". So for example going into header using index 3, i .e. head[3] = 7, then we know we have to go into the next list, lscl at index[7] = 3, then we continue and go to index[3] = 1, and finally we stop since index[1] = E.

I have had a few problems implementing this in the past:

  1. Doesn't seem to work for arbitrary coordinate systems, i.e. it has to be in the positive quadrant, requires a rescaling of coordinates
  2. If a particle is directly at the edge it seems to break the algorithm (user experience)

Then the basic idea is to look at 9 cells at time like this:


https://www.computer.org/csdl/journal/td/2014/01/ttd2014010043/13rRUxYrbLS

Kind regards

What I have trying to implement in the past is based on this small guide:

http://cacs.usc.edu/education/cs596/01-1LinkedListCell.pdf

This is basically the same algorithm as the algorithm I linked to Wikipedia (the idea is grid bucketing) with some implementation techniques.

but I think this algorithm will turn out to be quite slow in the end, due to having calculating the intersections between circles when finding neighbours

This is not true as the algorithm is the same. Checking if there is an intersection between two disks of radius r is the same as checking two center points is closer than distance r. "Disk intersection" is just another way to say the "close neighbor".

Doesn't seem to work for arbitrary coordinate systems, i.e. it has to be in the positive quadrant, requires a rescaling of coordinates

Yes, you have to translate the coordinate (something like ((y - y_min) * (x_max + 1 - x_min) + (x - x_min))).

If a particle is directly at the edge it seems to break the algorithm (user experience)

If you have the grid size the same as your radius and you need an inclusive range then yes that is an edge case. A solution is translating the points a small amount to make no point have exactly on the edge. Or use a slightly larger grid size.

Since I needed it for a similar problem, I implemented the Linked List Cell neighbor search algorithm in Rust. I just finished it, and I posted my implementation in the code review section of the forum. Feel free to check it out :slight_smile:

1 Like

I see, thank you very much for the explanation regarding the algorithms. And good to get confirmation that it has in fact to be rescaled.

Thanks.

Kind regards

1 Like