Rust auto-vectorisation difference?

If you're trying to get things to vectorize, you don't want to write things like this. It's possible that LLVM will do something smart, but to get the best results you want to give it more information.

Crucially, you haven't written anything in this function that says that dst.l and dst.r will be long enough. And thus LLVM needs to carefully ensure that it never writes past the valid end, and that only the correct things are updated in those slices if you catch the panic from the out-of-bounds indexing.

The most direct change for this code is what I call "reslicing":

pub fn mix_mono_to_stereo_via_reslicing(
    dst: &mut StereoSample,
    src: &MonoSample,
    gain_l: f32,
    gain_r: f32,
) {
    let n = src.0.len();
    let (dst_l, dst_r, src_0) = (&mut dst.l[..n], &mut dst.r[..n], &src.0[..n]);
    for i in 0..n {
        dst_l[i] = src_0[i] * gain_l;
        dst_r[i] = src_0[i] * gain_r;
    }
}

By doing this, you've made it really clear to LLVM that dst_l, dst_r, and src_0 are all definitely exactly n items long. So either the function will panic before the loop (because the dst vectors aren't long enough) or the loop will always run to completion, taking exactly n iterations, with definitely no out-of-bounds accesses.

That makes it as easy as possible for it to actually vectorize, which it does: https://rust.godbolt.org/z/szv6njr8c

As for checking whether it vectorized, I find it much easier to see in the LLVM IR. Conveniently, there's a section in the function labelled vector.body:

Full Block
vector.body:                                      ; preds = %vector.body, %vector.ph.new
  %index = phi i64 [ 0, %vector.ph.new ], [ %index.next.1, %vector.body ], !dbg !426
  %niter = phi i64 [ 0, %vector.ph.new ], [ %niter.next.1, %vector.body ]
  %17 = getelementptr inbounds [0 x float], [0 x float]* %_21.i.i1.i.i25, i64 0, i64 %index, !dbg !426
  %18 = bitcast float* %17 to <4 x float>*, !dbg !429
  %wide.load = load <4 x float>, <4 x float>* %18, align 4, !dbg !429, !alias.scope !431
  %19 = getelementptr inbounds float, float* %17, i64 4, !dbg !429
  %20 = bitcast float* %19 to <4 x float>*, !dbg !429
  %wide.load48 = load <4 x float>, <4 x float>* %20, align 4, !dbg !429, !alias.scope !431
  %21 = getelementptr inbounds [0 x float], [0 x float]* %_21.i.i1.i.i, i64 0, i64 %index, !dbg !426
  %22 = fmul <4 x float> %wide.load, %broadcast.splat, !dbg !434
  %23 = fmul <4 x float> %wide.load48, %broadcast.splat50, !dbg !434
  %24 = bitcast float* %21 to <4 x float>*, !dbg !434
  store <4 x float> %22, <4 x float>* %24, align 4, !dbg !434, !alias.scope !435, !noalias !437
  %25 = getelementptr inbounds float, float* %21, i64 4, !dbg !434
  %26 = bitcast float* %25 to <4 x float>*, !dbg !434
  store <4 x float> %23, <4 x float>* %26, align 4, !dbg !434, !alias.scope !435, !noalias !437
  %27 = bitcast float* %17 to <4 x float>*, !dbg !439
  %wide.load51 = load <4 x float>, <4 x float>* %27, align 4, !dbg !439, !alias.scope !431
  %28 = bitcast float* %19 to <4 x float>*, !dbg !439
  %wide.load52 = load <4 x float>, <4 x float>* %28, align 4, !dbg !439, !alias.scope !431
  %29 = getelementptr inbounds [0 x float], [0 x float]* %_21.i.i1.i.i22, i64 0, i64 %index, !dbg !426
  %30 = fmul <4 x float> %wide.load51, %broadcast.splat54, !dbg !440
  %31 = fmul <4 x float> %wide.load52, %broadcast.splat56, !dbg !440
  %32 = bitcast float* %29 to <4 x float>*, !dbg !440
  store <4 x float> %30, <4 x float>* %32, align 4, !dbg !440, !alias.scope !441, !noalias !431
  %33 = getelementptr inbounds float, float* %29, i64 4, !dbg !440
  %34 = bitcast float* %33 to <4 x float>*, !dbg !440
  store <4 x float> %31, <4 x float>* %34, align 4, !dbg !440, !alias.scope !441, !noalias !431
  %index.next = or i64 %index, 8, !dbg !426
  %35 = getelementptr inbounds [0 x float], [0 x float]* %_21.i.i1.i.i25, i64 0, i64 %index.next, !dbg !426
  %36 = bitcast float* %35 to <4 x float>*, !dbg !429
  %wide.load.1 = load <4 x float>, <4 x float>* %36, align 4, !dbg !429, !alias.scope !431
  %37 = getelementptr inbounds float, float* %35, i64 4, !dbg !429
  %38 = bitcast float* %37 to <4 x float>*, !dbg !429
  %wide.load48.1 = load <4 x float>, <4 x float>* %38, align 4, !dbg !429, !alias.scope !431
  %39 = getelementptr inbounds [0 x float], [0 x float]* %_21.i.i1.i.i, i64 0, i64 %index.next, !dbg !426
  %40 = fmul <4 x float> %wide.load.1, %broadcast.splat, !dbg !434
  %41 = fmul <4 x float> %wide.load48.1, %broadcast.splat50, !dbg !434
  %42 = bitcast float* %39 to <4 x float>*, !dbg !434
  store <4 x float> %40, <4 x float>* %42, align 4, !dbg !434, !alias.scope !435, !noalias !437
  %43 = getelementptr inbounds float, float* %39, i64 4, !dbg !434
  %44 = bitcast float* %43 to <4 x float>*, !dbg !434
  store <4 x float> %41, <4 x float>* %44, align 4, !dbg !434, !alias.scope !435, !noalias !437
  %45 = bitcast float* %35 to <4 x float>*, !dbg !439
  %wide.load51.1 = load <4 x float>, <4 x float>* %45, align 4, !dbg !439, !alias.scope !431
  %46 = bitcast float* %37 to <4 x float>*, !dbg !439
  %wide.load52.1 = load <4 x float>, <4 x float>* %46, align 4, !dbg !439, !alias.scope !431
  %47 = getelementptr inbounds [0 x float], [0 x float]* %_21.i.i1.i.i22, i64 0, i64 %index.next, !dbg !426
  %48 = fmul <4 x float> %wide.load51.1, %broadcast.splat54, !dbg !440
  %49 = fmul <4 x float> %wide.load52.1, %broadcast.splat56, !dbg !440
  %50 = bitcast float* %47 to <4 x float>*, !dbg !440
  store <4 x float> %48, <4 x float>* %50, align 4, !dbg !440, !alias.scope !441, !noalias !431
  %51 = getelementptr inbounds float, float* %47, i64 4, !dbg !440
  %52 = bitcast float* %51 to <4 x float>*, !dbg !440
  store <4 x float> %49, <4 x float>* %52, align 4, !dbg !440, !alias.scope !441, !noalias !431
  %index.next.1 = add nuw i64 %index, 16, !dbg !426
  %niter.next.1 = add i64 %niter, 2, !dbg !426
  %niter.ncmp.1 = icmp eq i64 %niter.next.1, %unroll_iter, !dbg !426
  br i1 %niter.ncmp.1, label %middle.block.unr-lcssa, label %vector.body, !dbg !426, !llvm.loop !442

In which you'll find

  • vector loads, to read multiple things at once
load <4 x float>
  • vector floating-point multiplications, to apply the gain to multiple things at once
fmul <4 x float>
  • and vector stores, to write the multiple results at once
store <4 x float>

In assembly those are somewhat less obvious, especially because on x64 it's common to use the SIMD registers even for scalar floating point -- mulps vs mulss doesn't jump off the page the way the vectors do in LLVM-IR.

But the compiler's pretty smart about slices and iterators, so you might also try writing it like this:

pub fn mix_mono_to_stereo_via_zip(
    dst: &mut StereoSample,
    src: &MonoSample,
    gain_l: f32,
    gain_r: f32,
) {
    for ((dst_l, dst_r), src) in std::iter::zip(&mut dst.l, &mut dst.r).zip(&src.0) {
        *dst_l = src * gain_l;
        *dst_r = src * gain_r;
    }
}

That also (same godbolt link) seems to vectorize quite well. Semantically it's a bit different, though, since it stops at whichever of the vectors is shortest, rather than panicking if the destination isn't long enough.

4 Likes