Understanding Rusts Auto-Vectorization and Methods for speed increase

  1. Don't use &Vec<T> parameters; use &[T] instead. It's just a pure win -- less indirection, more flexible for your callers, and doesn't actually keep you from doing anything.

  2. Use "reslicing" to make it as obvious as possible to LLVM that your array indexes are in-bounds, if you're going to use indexes into multiple slices.

For example, this emits two bounds checks every iteration, because it needs to write the correct prefix of z and panic in the correct place if x or y is actually shorter than z:

pub fn demo_slow(x: &[i32], y: &[i32], z: &mut [i32]) {
    for i in 0..z.len() {
        z[i] = x[i] * y[i];
    }
}

But if you reslice all your parameters to the length you need,

pub fn demo_fast(x: &[i32], y: &[i32], z: &mut [i32]) {
    let n = z.len();
    let (x, y, z) = (&x[..n], &y[..n], &mut z[..n]);
    for i in 0..z.len() {
        z[i] = x[i] * y[i];
    }
}

Then it'll panic up-front if they're too short, and LLVM knows the indexing will always be in-bounds, so it can unroll and vectorize the loop. On something with AVX, it looks like it does 4×8×i32 multiplications at once: https://rust.godbolt.org/z/a58vYKW5M.

  1. For tiny loop bodies, always use internal iteration instead of external iteration. (Usual link: Rust’s iterators are inefficient, and here’s what we can do about it. | by Veedrac | Medium)

This iterator implementation, for example:

pub fn demo_iter(x: &[i32], y: &[i32], z: &mut [i32]) {
    let products = std::iter::zip(x, y).map(|(&x, &y)| x * y);
    std::iter::zip(z, products).for_each(|(z, p)| *z = p);
}

Optimizes to the same inner loop as demo_fast above https://rust.godbolt.org/z/5Ez5YnY1x.

(It uses the shortest length between x, y, and z, though, rather than ever panicking, so it doesn't do exactly the same thing. Whether that's what you wanted is a requirements question.)

  1. Remember that floating-point is non-associative.

This vectorizes great:

pub fn dot_int(x: &[i32], y: &[i32]) -> i32 {
    std::iter::zip(x, y).map(|(&x, &y)| x * y).sum()
}

But this only unrolls, without vectorizing:

pub fn dot_float(x: &[f32], y: &[f32]) -> f32 {
    std::iter::zip(x, y).map(|(&x, &y)| x * y).sum()
}

(https://rust.godbolt.org/z/538TcPvrT)

That's because (a + b) + c and a + (b + c) can be different for floating-point, so LLVM is rightly not changing your code to do something different. Thus it's more difficult to get LLVM to auto-vectorize things involving floating-point than for things involving integers. It's still possible -- I have an example in SIMD for a FIR filter: unstable result and partitioning in as_simd - #9 by scottmcm -- but it'll probably be easier to find a library you like that provides the basics already optimized.

(And if you're willing to use nightly to try out new stuff, check out things like f64x4 in core::simd - Rust )

12 Likes