Fast sqrt for f64?

Hello to everybody.

I have an algorithm that computes n-body and there is a moment that I have to compute the sqrt of a big f64.

this action takes more than 50% of the total program time, there are some alternatives to doing srtq instead of number.sqrt(). I don't need a lot of precision, so approximations will be useful.

Here I place the part of the code (I'm using fast_floats crate, that's why I need to add the .get())

let dsquared = (dx * dx) + (dy * dy) + (dz * dz) + 1e-20;
let d32 = 1. / dsquared.get().sqrt().powi(3);

Thank you !

1 Like

For this specific case, you could use the fast inverse square root algorithm from Quake III Arena, implemented in the crate fast_inv_sqrt. Bonus points: the author used the original comments in its port :grin:

EDIT:
Quoted from wikipedia:

With subsequent hardware advancements, especially the x86 SSE instruction rsqrtss , this method is not generally applicable to modern computing,[4] though it remains an interesting example both historically[5] and for more limited machines, such as low-cost embedded systems.

If you are using an x86_64 machine, this won't probably help you in terms of performance.

6 Likes

Thank you !

I think that inv_sqrt if for doing 1/sqrt(x), right ? But I have to do 1/ (sqrt(x).powi(3)), so I think that I will have different results, or not ?

(1/x)^n == 1/(x^n)

:blush:
Obviously numeric results are probably different because of floating point approximations, but I don't think that this could be an issue for your case.

3 Likes

Oh yes, you are right !

I don't know why, but doing

dsquared.get().inv_sqrt32().powi(3); (or using inv_sqrt64)

Takes much longer than before.

I have AVX512 instrucciones and I'm compiling with:

RUSTFLAGS='-Ctarget-feature=+avx512f'  cargo run --release --bin program

I don't know if the AVX512 instructions are affecting something

You are using an x64_64 architecture, therefore the fast inverse square root algorithm is slower than you can get from naive instructions.

Take with a grain of salt, but maybe you could see if you can convince the compiler to auto-vectorize the loops for multiple points. In theory, you could get some speedup performing multiple square roots using SIMD, but you need to benchmark things to check if you are really gaining something or not.

Take this simple example: I don't have the minimum idea if the version that performs 4 square roots is faster or slower than the one that performs 8 square roots.

Oh, by the way: if you don't need high precision, then you could use f32 instead of f64!

3 Likes

And I forgot one detail: if you really want to be able to use auto-vectorization, be sure your floats are correctly aligned. My example is how you shouldn't write something to be SIMD-friendly.

As always, SIMD can be a tricky beast, there are many things to take into account.

1 Like

If you can use f32 and don't need cross-architecture code, you could try core_arch::x86_64::_mm_rsqrt_ss - Rust to use the processor's fast inverse sqrt.

1 Like

Could you provide more context for how the resulting value is used and what the loop structure is like?

1/(x*sqrt(x)) should be slightly better than 1/(sqrt(x)³) (one multiplication fewer), although if the accuracy of the approximate methods is sufficient, I'd use 1/sqrt(x³) as the 1/sqrt(y) can be one instruction. Since you mentioned avx512, see also core::arch::x86_64::_mm_rsqrt14_sd.

1 Like

Thanks to all !

The main objective is to use automatic vectorization. I suspect that the sqrt couldn't be improved, but perhaps the loops yes. This is my main function:

 #[inline]
fn run(galaxy: &mut Galaxy) {
    let bodies = &mut (galaxy.bodies);
    let poses = &mut (galaxy.poses);
    let masses = &(galaxy.masses);
    (0..D).for_each(|_| {
        let forces = poses.par_iter().zip(&masses[..]).map(|(pj, massj)| {
            let force =
                zip(&poses[..], &masses[..]).fold(Force::new(), |force_acc, (pi, massi)| {
                    let dx = pi.x - pj.x;
                    let dy = pi.y - pj.y;
                    let dz = pi.z - pj.z;
                    let dsquared = (dx * dx) + (dy * dy) + (dz * dz) + SOFT;
                    let d32 = 1. / dsquared.get().sqrt().powi(3);
                    let f = G * *massj * *massi;
                    Force::new_with(
                        force_acc.x + f * dx * d32,
                        force_acc.y + f * dy * d32,
                        force_acc.z + f * dz * d32,
                    )
                });

            Force::new_with(
                (force.x / *massj) * 0.5 * DT,
                (force.y / *massj) * 0.5 * DT,
                (force.z / *massj) * 0.5 * DT,
            )
        });

        bodies.par_iter_mut().zip(forces).for_each(|(body, force)| {
            let (dvx, dvy, dvz) = body.add_force(&force);

            body.xvi = dvx;
            body.yvi = dvy;
            body.zvi = dvz;
            body.dpx = dvx * DT;
            body.dpy = dvy * DT;
            body.dpz = dvz * DT;
        });

        poses
            .par_iter_mut()
            .zip(&bodies[..])
            .for_each(|(pos, body)| pos.add_body(body));
    });
}

I'm using rayon for parallel loops and fast-floats for approximations.

If I use f64 with N = 65536, It takes ~16 seconds. If I comment the .sqrt and doing only

let d32 = 1. / dsquared.get().powi(3);

it takes ~8 seconds. But PROBABLY, if I change some loops will be useful ? Or not, I don't know.

If you need the full code I could send it.

Again, thank you so much. I'm learning every day about Rust and I have doubts as to whether what I am doing is the best way.

Do you have something small that will compile? This looks really interesting!

Do be careful here; in numerical simulations errors will accumulate with every tick. A bad approximation can cause your results to get considerably off. You will want to profile for this.

I'd wager that there are better approximations for this than substituting approximations for sqrt . I wouldn't worry too much about it though.

Check the assembly (using something like godbolt.org) to see whether the compiler is actually generating simd instructions.

Also just a nit: implement Add and Sub for Force so you can subtract Force directly rather than having to disassemble it and create a new one. You can implement similar operations for body too. This will simplify your code a lot.

1 Like

Thank you for the recommendations!

I will do that.

I am studying about the mm512 instructions. I have one problem and it's that I cant reuse the result of the instructions, for example, this the equation what we were talking:

        let dsquared = (dx * dx) + (dy * dy) + (dz * dz) + SOFT;
        let dsquared = _mm512_sqrt_pd(_mm512_set1_pd(dsquared.get()));
        let d32 = _mm512_div_pd(
            _mm512_set1_pd(1.),
            _mm512_mul_pd(_mm512_mul_pd(dsquared, dsquared), squared),
        );

then if I want to get the result I do:

        let mut result: f64 = 0_f64;
        _mm512_storeu_pd(&mut result, d32);

This works perfect, but the problem is that I cant reuse "result', for example I cant do:

let var = result + 1.;

I always get 0, while when before I print result, it had value.

What I'm missing ?

Thanks a lot !

I've actually been looking through intels documentation; they also have _mm512 intrinsics for calculating the hypotenuse of floats (i.e. sqrt(x^2+y^2)). Maybe these will help you more.

That's definitely UB since you are writing 64 bytes into an 8-byte variable.

The intrinsic to do what you want in that case (extract the first f64) seems to be missing from Rust's std::arch::x86_64. Intel's documentation for the C version: _mm512_cvtsd_f64

The point of SIMD (single-instruction, multiple data) is that you get to operate in the same way on multiple values. For the program you're implementing, I think a reasonable goal would be that you would be computing the force contributions of multiple (say, 8) bodies on one body at the same time. For example, you'd load one SIMD-vector with the x-coordinates of 8 bodies, another with the y-coordinates, and so on, and operate these with another SIMD-vector with many copies of the corresponding values for the other body.

But let's step back a bit, the thing to note about that is that it would be most convenient to have all the x-coordinates of the bodies in one array, all the y-coordinates in another, and so on. This idea is also known as a structure of arrays. That kind of code could better facilitate auto-vectorization as well. You can play with the intrinsics a bit to get a feel for what SIMD can do well or not-so-well, but I don't recommend you start implementing your program with them. It can get really messy quick, which will make it very difficult to restructure your code.

For one different idea on your solution: Have you considered using Newton's third law to optimize your program? This could reduce the number of times you have to compute those slow square roots or division by 50%, though it will require reorganizing the computation.

Using a &mut [f32; N] parameter instead of returning an array results in much more reasonable code for the 4 case.

1 Like

Recently I've added vectorization and SIMD to a n-body simulation (it's not Newton's laws, but similar). The longer was to figure out what are the more suited intrinsics, and to debug (all the intrinsics don't take the arguments in the same order)...

As expected, the vectorized f32 version is 4 times faster. (if you have AVX-512, you could probably achieve 8 times faster)

You can also use the Barnes-Hut approximation to be O(n log n) instead of O(n^2).

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.