Hello!

**GOAL**: Code review of single core implementation of algorithm currently running 100 time steps in 1 s. Goal is to if possible reach 100 time steps in 0.2 s.

I am a pretty new user, started about 3 days ago and would very much like a Code Review of the first thing I have made in Rust. It is an algorithm based on a paper found here:

https://www.researchgate.net/publication/256688486_Particle_packing_algorithm_for_SPH_schemes

The specific algorithm it self is not interesting, just wanted to leave the reference for those with a deeper interest. The output of the algoritm will be a filled square as such:

Initially the particles are distributed as.. (was only allowed 1 picture in the post, but imagine uniform circular distribution of particles inside the square)

So basically the algorithm fills out empty space.

I have uploaded my code: https://github.com/AhmedSalih3d/particlepacking-rust/tree/initial

Please clone from the initial branch since it holds an unchanged version. The code consists of roughly five files:

- algorithm.rs - The most important one in this case, it implements the algorithm I want to increase performance of
- dataio.rs - Simple way to read the data points in "RustCircleUniformGridSQUARE.txt"
- main.rs - Runs the algorithm
- plotting.rs - Simple way to plot the result as svg, which is saved as "Points.svg". Can be visualized using Chrome etc.
- point.rs - Implementation of a simple "Point" structure, consisting 3 f32. Overloading of operators for easier math and println! functionality performed in there.

Now I will show the most relevant parts of the project i.e. the parts I want specifically to be optimized.

**main.rs**

In here I initialize four vectors of my custom struct "Point". The data is based on the text file mentioned in `FILENAME`

. The reason I have to have two of each i.e. by cloning, will be shown in the next code snippet. For easing readability I pass these four vectors by reference to a struct. `NCUT`

is used to make it so that the particles forming the square are not computed (reference first picture). I then use a timing functionality to time the performance of my loop. THIS is the main part I want to optimize.

```
fn main() {
const FILENAME:&str = "RustCircleUniformGridSQUARE.txt";
let mut pini = dataio::readpoints(FILENAME);
let mut uini = genvelocity(&pini);
let mut ptmp = pini.clone();
let mut utmp = uini.clone();
let mut data = point::Particles { pvec: &mut pini,
uvec: &mut uini,
ptmp: &mut ptmp,
utmp: &mut utmp
};
const NCUT: usize = 800;
let now = Instant::now();
for _ in 0..100 {
algorithm::packstep_s(&mut data, NCUT);
}
println!("{}", now.elapsed().as_millis());
const NAME:&str = "Points.svg";
// How to transfer ownership correctly..
plotting::plot_points(&pini, NAME);
}
```

**algorithm::packstep_s(&mut data, NCUT);**

This function takes the "data" struct which holds references to all the vectors and ncut, with the purpose on running the algorithm on every particles from index 0 to n - ncut. When a calculation for all particles are done, it finally clones the tmp results into the correct vector. This is to ensure that all calculations are performed at same time step of data. The function `packstep_single(data,i)`

is shown in next snippet below this one.

```
pub fn packstep_s(data: &mut point::Particles<'_>,
ncut: usize) {
let n = data.pvec.len();
for i in 0..(n - ncut) {
packstep_single(data,i);
}
data.pvec[..(n - ncut)].clone_from_slice(&data.ptmp[..(n - ncut)]);
data.uvec[..(n - ncut)].clone_from_slice(&data.utmp[..(n - ncut)]);
}
```

**packstep_single(data: &mut point::Particles<'_>, iter: usize)**

This is the algorithm where the heavy calculation happens - and this is my primary focus for optimization. I have defined a constant struct "AC", which holds constants to ease implementation and then the algorithm is carried out as shown below.

```
fn packstep_single(data: &mut point::Particles<'_>, iter: usize){
let mut wgx = 0.0;
let mut wgy = 0.0;
let mut wgz = 0.0;
let p_i = data.pvec[iter];
for (j, p_j) in data.pvec.iter().enumerate() {
if j != iter {
let rij = p_i - *p_j;
let mut rij2 = (rij * rij).sum();
if rij2 <= AC::H2 {
rij2 = rij2.sqrt();
let rij1 = 1.0 / (rij2);
let q = rij2 * AC::H1;
let q1 = q - 2.0;
let qq3 = q * q1 * q1 * q1;
let wq = AC::AD * AC::FAC * qq3;
wgx += wq * (rij.0 * rij1) * AC::H1;
wgy += wq * (rij.1 * rij1) * AC::H1;
wgz += wq * (rij.2 * rij1) * AC::H1;
}
}
}
let u_i = data.uvec[iter];
let dux = (-AC::BETA * wgx * AC::V - AC::ZETA * u_i.0) * AC::DT;
let duy = (-AC::BETA * wgy * AC::V - AC::ZETA * u_i.1) * AC::DT;
let duz = (-AC::BETA * wgz * AC::V - AC::ZETA * u_i.2) * AC::DT;
data.utmp[iter] = u_i + point::Point(dux, duy, duz);
data.ptmp[iter] = p_i + point::Point(dux * AC::DT, duy, duz * AC::DT);
}
```

So this was a fast overview of the most important functionality, now my hope is that I have intrigued some of you enough that you would like to give it a look. I basically compile it using:

```
RUSTFLAGS="-Ctarget-cpu=native" cargo run --release
```

Which when the program is run gives a performance of ~1 second, ie. 1000 ms which is printed out in the terminal. Since I run 100 time steps, this gives me a "FPS" ratio of 100. I would like this to be around 500 and I am sure some more advanced Rust users can help me achieve this.

**What Have I done Already**

- I have run
`cargo clippy`

and implemented all suggestions. - Spent time to implement it to the best of my ability. I have already implemented it in Matlab, Python, C++, Julia - especially Julia shows promising results but Rust is also very good already with my naive implementation.
- Implemented a basic
`Cargo.toml`

based on default suggestions for`release`

**What can be done**

Other than improving single core performance it is possible to implement for example threading. This would be done by calling `packstep_single`

on different threads using different values of `i`

, in other words particle index. I was not succesful in doing this since I met a lot of problems with shared reference in `data`

struct etc. An other approach would be `async`

, since the order of calculation does not matter in my setup.

Other ideas would be to implement neighbour search algorithms so I only have to look at few particles at a time - if you know any such libraries in Rust do please share.

Finally one could turn to the GPU, but I think that it is a bit too advanced for now. Currently I am primarily interested in ensuring that I have pushed single core performance to the max and then implement threading - would love to have some help in regards to that.

Kind regards, thanks for reading if you made it all the way through