Understanding Rusts Auto-Vectorization and Methods for speed increase

Hi there!

I recently started learning rust as I need an extreme speed up in a project that previously was written only in python.

The first (and simplest) function I translated simply calculates the norm of many 2d vectors. The arguments are a vector of x components and one of y components which are both of length n. The output is a vector of length n containing all the norms.

To my surprise even the first and simple implementation was 2.5 times faster than the fastest python solution:

pub fn calc_norm(xs: &Vec<f64>, ys: &Vec<f64>) -> Vec<f64>{
    let mut norm: Vec<f64> = vec![0.0; xs.len()];
    for i in 0..xs.len(){
        norm[i] = ((xs[i]).powi(2) + (ys[i]).powi(2)).sqrt();
    }
    return norm
}

This however is where my questions start to arise.

  • Firstly, what is the way of dealing with a problem like this in rust? In python my usual approach was to always look for a numpy solution, as this was always faster. Does this apply to rust too, i.e. should I just look for a crate that does this? And if this is the case, how would one find the appropriate crate?

  • Secondly, how can I vectorize this function? I read through many blogs/questions/forum posts and tried their solutions but it didn't work out at all. It actually only took longer to run my code. I thought the assert statement in the code below was needed (to allow for auto-vectorization) so the compiler knows that we are never out bounds of the array (is this true?). Also I have read that iterators would be much faster, but it actually made my code slower - so what went wrong here?

pub fn calc_norm2(xs: &Vec<f64>, ys: &Vec<f64>) -> Vec<f64>{
    assert_eq!(xs.len(), ys.len(), "Can't calculate the 2d norm if the number of x and y components doesn't match");
    let mut norm: Vec<f64> = vec![0.0; xs.len()];
    for (i, rs) in xs.iter().zip(ys.iter()).enumerate() {
        norm[i] = ((rs.0).powi(2) + (rs.1).powi(2)).sqrt();
    }
    return norm
}
  • My Final question towards this topic is, what are general principals to follow in rust to make my code as fast as possible?

Thank you in advance for your answer!

No, not necessarily. In Python, you'll get substantial speedups from using NumPy, because Python is interpreted, whereas NumPy is written in C (and presumably compiled with an optimizing C compiler). Rust is already compiled, so a simple loop in itself is comparable to a loop in C. (Except that you are using indexing instead of iterators in your example above, so there may be bounds checking, but that can be removed by using zip instead).

However, if you want convenient and type-safe manipulation of general arrays, you should look into the ndarray crate, because you should probably not re-write such functionality all the time, even if Rust lets you do it as fast as C or NumPy. That'd risk introducing bugs related to edge cases that the ndarray crate has probably fixed long ago.

What you posted should auto-vectorize as-is. Are you compiling and running it in optimized (i.e., release) mode? If so, it's possible that the aforementioned bounds checking hinders vectorization. Instead of indexing, you should just collect instead:

pub fn calc_norm2(xs: &[f64], ys: &[f64]) -> Vec<f64> {
    assert_eq!(xs.len(), ys.len(), "Can't calculate the 2d norm if the number of x and y components doesn't match");
    xs.iter().zip(ys).map(|(&x, &y)| f64::sqrt(x*x + y*y)).collect()
}

Generally, try to follow the idioms of the language. For example, I removed the explicit indexing above. I also changed the input from &Vec to &[f64] because the latter allows you to use the function even if you don't already have a vector in the first place, potentially avoiding an unnecessary allocation. And don't jump to using unsafe for optimization until you have measured that safe code is in fact a real bottleneck.

For your particular case, you can also have a look to the crate nalgebra.
It provides generally good performances.

Consider using hypot.

2 Likes
  1. Don't use &Vec<T> parameters; use &[T] instead. It's just a pure win -- less indirection, more flexible for your callers, and doesn't actually keep you from doing anything.

  2. Use "reslicing" to make it as obvious as possible to LLVM that your array indexes are in-bounds, if you're going to use indexes into multiple slices.

For example, this emits two bounds checks every iteration, because it needs to write the correct prefix of z and panic in the correct place if x or y is actually shorter than z:

pub fn demo_slow(x: &[i32], y: &[i32], z: &mut [i32]) {
    for i in 0..z.len() {
        z[i] = x[i] * y[i];
    }
}

But if you reslice all your parameters to the length you need,

pub fn demo_fast(x: &[i32], y: &[i32], z: &mut [i32]) {
    let n = z.len();
    let (x, y, z) = (&x[..n], &y[..n], &mut z[..n]);
    for i in 0..z.len() {
        z[i] = x[i] * y[i];
    }
}

Then it'll panic up-front if they're too short, and LLVM knows the indexing will always be in-bounds, so it can unroll and vectorize the loop. On something with AVX, it looks like it does 4×8×i32 multiplications at once: https://rust.godbolt.org/z/a58vYKW5M.

  1. For tiny loop bodies, always use internal iteration instead of external iteration. (Usual link: Rust’s iterators are inefficient, and here’s what we can do about it. | by Veedrac | Medium)

This iterator implementation, for example:

pub fn demo_iter(x: &[i32], y: &[i32], z: &mut [i32]) {
    let products = std::iter::zip(x, y).map(|(&x, &y)| x * y);
    std::iter::zip(z, products).for_each(|(z, p)| *z = p);
}

Optimizes to the same inner loop as demo_fast above https://rust.godbolt.org/z/5Ez5YnY1x.

(It uses the shortest length between x, y, and z, though, rather than ever panicking, so it doesn't do exactly the same thing. Whether that's what you wanted is a requirements question.)

  1. Remember that floating-point is non-associative.

This vectorizes great:

pub fn dot_int(x: &[i32], y: &[i32]) -> i32 {
    std::iter::zip(x, y).map(|(&x, &y)| x * y).sum()
}

But this only unrolls, without vectorizing:

pub fn dot_float(x: &[f32], y: &[f32]) -> f32 {
    std::iter::zip(x, y).map(|(&x, &y)| x * y).sum()
}

(https://rust.godbolt.org/z/538TcPvrT)

That's because (a + b) + c and a + (b + c) can be different for floating-point, so LLVM is rightly not changing your code to do something different. Thus it's more difficult to get LLVM to auto-vectorize things involving floating-point than for things involving integers. It's still possible -- I have an example in SIMD for a FIR filter: unstable result and partitioning in as_simd - #9 by scottmcm -- but it'll probably be easier to find a library you like that provides the basics already optimized.

(And if you're willing to use nightly to try out new stuff, check out things like f64x4 in core::simd - Rust )

8 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.