-
Don't use
&Vec<T>
parameters; use&[T]
instead. It's just a pure win -- less indirection, more flexible for your callers, and doesn't actually keep you from doing anything. -
Use "reslicing" to make it as obvious as possible to LLVM that your array indexes are in-bounds, if you're going to use indexes into multiple slices.
For example, this emits two bounds checks every iteration, because it needs to write the correct prefix of z and panic in the correct place if x
or y
is actually shorter than z
:
pub fn demo_slow(x: &[i32], y: &[i32], z: &mut [i32]) {
for i in 0..z.len() {
z[i] = x[i] * y[i];
}
}
But if you reslice all your parameters to the length you need,
pub fn demo_fast(x: &[i32], y: &[i32], z: &mut [i32]) {
let n = z.len();
let (x, y, z) = (&x[..n], &y[..n], &mut z[..n]);
for i in 0..z.len() {
z[i] = x[i] * y[i];
}
}
Then it'll panic up-front if they're too short, and LLVM knows the indexing will always be in-bounds, so it can unroll and vectorize the loop. On something with AVX, it looks like it does 4×8×i32
multiplications at once: https://rust.godbolt.org/z/a58vYKW5M.
- For tiny loop bodies, always use internal iteration instead of external iteration. (Usual link: Rust’s iterators are inefficient, and here’s what we can do about it. | by Veedrac | Medium)
This iterator implementation, for example:
pub fn demo_iter(x: &[i32], y: &[i32], z: &mut [i32]) {
let products = std::iter::zip(x, y).map(|(&x, &y)| x * y);
std::iter::zip(z, products).for_each(|(z, p)| *z = p);
}
Optimizes to the same inner loop as demo_fast
above https://rust.godbolt.org/z/5Ez5YnY1x.
(It uses the shortest length between x
, y
, and z
, though, rather than ever panicking, so it doesn't do exactly the same thing. Whether that's what you wanted is a requirements question.)
- Remember that floating-point is non-associative.
This vectorizes great:
pub fn dot_int(x: &[i32], y: &[i32]) -> i32 {
std::iter::zip(x, y).map(|(&x, &y)| x * y).sum()
}
But this only unrolls, without vectorizing:
pub fn dot_float(x: &[f32], y: &[f32]) -> f32 {
std::iter::zip(x, y).map(|(&x, &y)| x * y).sum()
}
(https://rust.godbolt.org/z/538TcPvrT)
That's because (a + b) + c
and a + (b + c)
can be different for floating-point, so LLVM is rightly not changing your code to do something different. Thus it's more difficult to get LLVM to auto-vectorize things involving floating-point than for things involving integers. It's still possible -- I have an example in SIMD for a FIR filter: unstable result and partitioning in as_simd - #9 by scottmcm -- but it'll probably be easier to find a library you like that provides the basics already optimized.
(And if you're willing to use nightly to try out new stuff, check out things like f64x4 in core::simd - Rust )