Optimization of Algorithm to Fill Empty Space

EDIT: I was being a dummy got it to work, with following code statement:

fn packstep_single(data: &mut point::Particles<'_>, iter: usize){
    
    let p_i = data.pvec[iter];

    let (wgx, wgy, wgz) = data
    .pvec
    .par_iter()
    .enumerate()
    .map(|(j, p_j)| {
        let rij = p_i - *p_j;
        let mut rij2 = (rij * rij).sum();
        if j != iter {
            if rij2 <= AC::H2 {
                rij2 = rij2.sqrt();
                let rij1 = 1.0 / (rij2);
                let q = rij2 * AC::H1;
                let q1 = q - 2.0;
                let qq3 = q * q1 * q1 * q1;
                let wq = AC::AD * AC::FAC * qq3;
                return
                (wq * (rij.0 * rij1) * AC::H1,wq * (rij.1 * rij1) * AC::H1,wq * (rij.2 * rij1) * AC::H1)
            }
        }
        // if we don't do the calculation, just contribute '0' to the sum
        (0.0, 0.0, 0.0)
    })
    .reduce(|| (0.0, 0.0, 0.0), |a, b| (a.0 + b.0, a.1 + b.1, a.2 + b.2));


    let u_i = data.uvec[iter];

    let dux = (-AC::BETA * wgx * AC::V - AC::ZETA * u_i.0) * AC::DT;
    let duy = (-AC::BETA * wgy * AC::V - AC::ZETA * u_i.1) * AC::DT;
    let duz = (-AC::BETA * wgz * AC::V - AC::ZETA * u_i.2) * AC::DT;

    data.utmp[iter] = u_i + point::Point(dux, duy, duz);
    data.ptmp[iter] = p_i + point::Point(dux * AC::DT, duy, duz * AC::DT);
}

Unfortunately the performance is not very good currently since it takes 6000 ms.

--OLD MESSAGE--
Hello @daboross!

Thank you very much for your interest and example code. I have tried to implement it:

let p_i = data.pvec[iter];

    let (wgx, wgy, wgz) = data
    .pvec
    .par_iter()
    .enumerate()
    .map(|(j, p_j)| {
        if j != iter {
            let rij = p_i - *p_j;
            let mut rij2 = (rij * rij).sum();
            if rij2 <= AC::H2 {
                rij2 = rij2.sqrt();
                let rij1 = 1.0 / (rij2);
                let q = rij2 * AC::H1;
                let q1 = q - 2.0;
                let qq3 = q * q1 * q1 * q1;
                let wq = AC::AD * AC::FAC * qq3;

                wgx = 2.0;
                wgy = 2.0;
                wgz = 2.0;
            }
            return
            (wgx,wgy,wgz)
            ;
        }
        // if we don't do the calculation, just contribute '0' to the sum
        (0.0, 0.0, 0.0)
    })
    .reduce(|| (0.0, 0.0, 0.0), |a, b| (a.0 + b.0, a.1 + b.1, a.2 + b.2));

But I get the following error:

error[E0425]: cannot find value `wgz` in this scope
  --> src/algorithm.rs:55:21
   |
55 |                     wgz += wq * (rij.2 * rij1) * AC::H1;
   |                     ^^^ not found in this scope

error: aborting due to 3 previous errors

For more information about this error, try `rustc --explain E0425`.
error: could not compile `a`.

And the same for all wgx/y/z. Am a bit new to Rayon so I struggle a bit with what to change to fix it. It is a scoping problem

Kind regards

1 Like

Not a performance-related point but I think your Particles<'a> struct is unidiomatic as you are introducing double indirection (&mut Particles<'a>) to use it. I would make it an owned struct and rewrite main as follows:

// point.rs
pub struct Particles {
    pub pvec: Vec<Point>,
    pub uvec: Vec<Point>,
    pub ptmp: Vec<Point>,
    pub utmp: Vec<Point>
}

// main.rs
    let points = dataio::readpoints(FILENAME);
    let velocities = genvelocity(&points);

    let mut data = point::Particles { pvec: points.clone(), 
                                      uvec: velocities.clone(),
                                      ptmp: points
                                      utmp: velocities
                                };
...

    plotting::plot_points(&data.pvec, NAME);

I'll investigate more.

1 Like

You can experiment with different values for RAYON_NUM_THREADS environment variable to see if that affects performance. I'd start with comparing the single-threaded implementation, and threaded implementation with 1 and 2 threads

I've been thinking about the algorithm more, and have a possible suggestion.

It looks like in every iteration inside packstep_single, the computation you're doing is 100% determined by the value in data.pvec. But this never changes iteration-to-iteration.

What if you did all that math once, stored it in something like a Vec<(f32, f32, f32)>, then did those last computations n times rather than duplicating the whole calculation only to exclude a single item?

Going even further, I think you could just sum them up with all the items once, and then for each packstep_single, simply subtract the calculated wgx,y,z for data.pvec[j] and do those final computations.

Thoughts?

Hello @pcpthm!

Thank you for your suggestion I really like it. I ended up doing as such:

let pini = dataio::readpoints(FILENAME);
let uini = genvelocity(&pini);
let ptmp = pini.clone();
let utmp = uini.clone();

let mut data = point::Particles { pvec: pini, 
                                  uvec: uini,  
                                  ptmp: ptmp,
                                  utmp: utmp
                            };

Where I transformed the struct as you suggested - made it a lot more readable and now I cannot use "pinI" anymore when doing the plot which I like since it is properly owned by the struct "data" now.

Kind regards

1 Like

Hello!

I used this command in main:

rayon::ThreadPoolBuilder::new().num_threads(8).build_global().unwrap();

And unfortunately it didn't affect performance in any good way - using less threads would make result better. This indicates that the calculation are "too fast" to benefit from threading right now is my assumption. I tested on a data set with 50000 points, compared to 3500 in original post, and still did not make a difference.

Kind regards

Hmm, I have tried to stirr away from arrays and use for loops since I have found the code easier to maintain, but I guess it is worth a try.

Basically you are right that for every call in packstep_single(data,iter), then I would look at particle, p[iter], and compute how it behaves when compared with all other particles, in pvec.

I guess one could then as you suggest instead of using a loop, use principles of linear algebra to say this rij = p[iter] - pvec (tuple), say, rij2 = (rij*rij).sum() (r^2 - f32) set every value which does not survive, if rij2 <= H2, to zero, then take sqrt over every element, multiply with AC::H1 to get q and then do the rest..

The reason I have avoided arrays/vectors as the "plague" is due to my bad experience in Julia and Matlab. I would love to see an example of how to use arrays/vectors/slices in Rust for this case, but I think the performance would be worse and I would have to apply "map" in some smart way to get the same performance. Would love to be proven wrong though.

Kind regards

I think this is probably a good idea.

I tried to implement my suggestion, and found that I had misread the code, and it was not as simple as I thought! Sorry for making a bad recommendation.

Thanks for the well thought out response, though!

I now understand your code.

The performance critical loop is

    for (j, p_j) in data.pvec.iter().enumerate() {
        if j != iter { // (1)
            let rij = p_i - *p_j;
            let mut rij2 = (rij * rij).sum();
            if rij2 <= AC::H2 { ...  } // (2)
       }
    }

I counted how much proportion the if (2) taken. And the result is the if is not taken order of magnitude often. Thus, the strategy of optimizing inside the if block is an order of magnitude meaningless and rather we want to reduce discarded loop as much as possible.

I recognize this problem as Nearest neighbor search (NNS) (in the 3D euclidean metrics). In particular, your points are static within a packstep_s call and thus we can build a static data structure of fast "find all points within distance r of the query point". Furthermore, the problem is more restricted as all radii rs are the same AC::H2 and also all your query points are given in advance (this is called an offline problem) though I don't know the offline NNS can be solved faster than online NNS.

The NNS problem can be solved by various algorithms. Space partitioning methods such as k-d-tree is one of the most famous ones. I'm pretty sure there are existing implementation crates in Rust you can use. I found kdtree, fux_kdtree. You can research more (edit: kdtree crate is not fast as only ~1.3x spedup. fux_kdtree crate doesn't provide "get all close points" API).

I implemented a simple pruning method and I can already see 5x speed up if the exact floating-point sum order is not preserved or 3x speed up to output the same result https://gist.github.com/pcpthm/fa72d57d33e3691c8a81aba9542eb400.

No problem! Thank you for spending time and trying to come with suggestions.

Kind regards

I see, so basically there is not much more to do to optimize other than using NB search etc.

Could you hint me to where to read a bit more about your simple pruning method? In the end I will probably go with one of the packages, would just like to learn. Thanks for helping me out with a functional example.

I saw 70% performance increase using your algorithm.rs.

Kind regards

Sorry, the pruning method (sort the points by a coordinate (or a random direction) and upper bound the distance by 1D distance) should be well-known to the computational geometry people but I cannot find the reference to the method. Maybe it can be called just "the divide and conquer algorithm of the closest pair problem minus divide and conquer".

Also I couldn't find a good Rust NNS crate but someone else could answer this question. Good luck.

I see, thank you very much for your help. Will have to dive deeper into it.

Kind regards

It turns out the problem (offline, fixed radius neighbor search) can exactly be described as unit disk graph construction. According to #Computational_complexity, it is possible to solve the problem in linear time (as a function of the output size). Could the algorithm described here (grid hashing based) faster in practice? It needs an investigation.

1 Like

I like the idea, but I think this algorithm will turn out to be quite slow in the end, due to having calculating the intersections between circles when finding neighbours. What I have trying to implement in the past is based on this small guide:

http://cacs.usc.edu/education/cs596/01-1LinkedListCell.pdf

The basic idea is as follows:

Split the domain into different cells and figure out which cell each particle lives in. The construct a header, which holds the "last particle". So for example going into header using index 3, i .e. head[3] = 7, then we know we have to go into the next list, lscl at index[7] = 3, then we continue and go to index[3] = 1, and finally we stop since index[1] = E.

I have had a few problems implementing this in the past:

  1. Doesn't seem to work for arbitrary coordinate systems, i.e. it has to be in the positive quadrant, requires a rescaling of coordinates
  2. If a particle is directly at the edge it seems to break the algorithm (user experience)

Then the basic idea is to look at 9 cells at time like this:


https://www.computer.org/csdl/journal/td/2014/01/ttd2014010043/13rRUxYrbLS

Kind regards

What I have trying to implement in the past is based on this small guide:

http://cacs.usc.edu/education/cs596/01-1LinkedListCell.pdf

This is basically the same algorithm as the algorithm I linked to Wikipedia (the idea is grid bucketing) with some implementation techniques.

but I think this algorithm will turn out to be quite slow in the end, due to having calculating the intersections between circles when finding neighbours

This is not true as the algorithm is the same. Checking if there is an intersection between two disks of radius r is the same as checking two center points is closer than distance r. "Disk intersection" is just another way to say the "close neighbor".

Doesn't seem to work for arbitrary coordinate systems, i.e. it has to be in the positive quadrant, requires a rescaling of coordinates

Yes, you have to translate the coordinate (something like ((y - y_min) * (x_max + 1 - x_min) + (x - x_min))).

If a particle is directly at the edge it seems to break the algorithm (user experience)

If you have the grid size the same as your radius and you need an inclusive range then yes that is an edge case. A solution is translating the points a small amount to make no point have exactly on the edge. Or use a slightly larger grid size.

Since I needed it for a similar problem, I implemented the Linked List Cell neighbor search algorithm in Rust. I just finished it, and I posted my implementation in the code review section of the forum. Feel free to check it out :slight_smile:

1 Like

I see, thank you very much for the explanation regarding the algorithms. And good to get confirmation that it has in fact to be rescaled.

Thanks.

Kind regards

1 Like

I spent some more time on it and found the mistake. Basically when using rayon initially we were applying the threads to the computation of elements inside an array in each time step. Due to communication between threads this would end up taking a lot of time.

Instead if one chose to let each thread calculate everything w.r.t. to one particle one would get a good performance boost. So in "pseudo-code" since I have changed the algorithm a bit:

pub fn packstep_s(data: &mut point::Particles,
                  ncut: usize) {
    let n = data.pvec.len();

    let mut pi_vec = data.pvec[0..(n-ncut)].to_vec();
    let mut ui_vec = data.uvec[0..(n-ncut)].to_vec();

    pi_vec
    .par_iter_mut()
    .zip(&mut ui_vec)
    .enumerate()
    .for_each(|(i,(p_ptr,u_ptr))|
        {
            let up_points = packstep_single(&data.pvec,&data.uvec,i);
            *p_ptr = up_points.ptmp;
            *u_ptr = up_points.utmp;
        }
    );

    // Only alters relevant indices
    data.pvec[..(n - ncut)].clone_from_slice(&pi_vec);
    data.uvec[..(n - ncut)].clone_from_slice(&ui_vec);
}

And running this gives a performance of 231 ms, which is roughly 4 times better than the 1000 ms gotten initially. This is without any of the neighbour search algorithms applied.

Kind regards

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.