SIMD for a FIR filter: unstable result and partitioning in as_simd

Hi,

I'm trying to use the portable_simd feature for a FIR filter. I've made a test that sums up the resulting 4096 f32's and the result is no longer exactly the same as the serial case. These are really small values summed up, so maybe that would be expected since the order of the summation is changed.

The filter essentially computes the dot product between a slice of coefficients and a series of values. The slices are always the same length, values are zero-padded when initialized.

I'm unfamiliar with SIMD and was wondering about some things:

  • is the partitioning into prefix, middle and suffix always the same for two slices of equal length? if so, that may be missing from the docs.
  • is the partitioning determined at compile-time or run-time?

Below is the code, with the original code commented out. Please also let me know if you spot anything else weird that I might have misunderstood. It's slightly complicated by the use of a heapless::Dequeue, since this is an embedded device accumulating samples:

    fn value(&self) -> f32 {
        // Convolve filter with samples.
        // self.samples
        //     .iter()
        //     .zip(&COEFFS)
        //     .fold(0.0, |a, (s, c)| a + (s * c))
        // let (_, coeffs, _) = COEFFS.as_simd();
        // debug_assert_eq!(coeffs.len() * 4, COEFFS.len());

        let (f, b) = self.samples.as_slices();
        let (cf, cb) = COEFFS.split_at(f.len());

        assert_eq!(f.len(), cf.len());
        assert_eq!(b.len(), cb.len());

        // First half of dequeue
        let (p, m, s) = f.as_simd::<4>();
        let (cp, cm, cs) = cf.as_simd::<4>();

        let sp = p.iter().zip(cp).fold(0.0, |a, (s, c)| a + (s * c));
        let ss = s.iter().zip(cs).fold(0.0, |a, (s, c)| a + (s * c));

        let fsums = f32x4::from_array([sp, 0.0, 0.0, ss]);
        let fsums = m.iter().zip(cm).fold(fsums, |a, (s, c)| a + (s * c));

        // Second half of dequeue
        let (p, m, s) = b.as_simd::<4>();
        let (cp, cm, cs) = cb.as_simd::<4>();

        let sp = p.iter().zip(cp).fold(0.0, |a, (s, c)| a + (s * c));
        let ss = s.iter().zip(cs).fold(0.0, |a, (s, c)| a + (s * c));

        let bsums = f32x4::from_array([sp, 0.0, 0.0, ss]);
        let bsums = m.iter().zip(cm).fold(bsums, |a, (s, c)| a + (s * c));

        (fsums + bsums).reduce_sum()
    }

No, definitely not. It depends on the alignment. Here's a quick counterexample: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=24ecfe6d97c86b534efec02cc92edaac

If you want something you can zip together, consider as_chunks instead, which always puts things at the end, but you'll need to Simd::from_array to load the chunks, since they might not be aligned.

1 Like

I would say no. As the documentation says, as_simd() is merely a safety wrapper around align_to(), which has basically zero guarantees regarding the result.

This is kind of an ambiguous question: the method is const-generic, so there is inevitably a compile-time part (after all, for SIMD to work, you need specific, known-size chunks), but since the slice can be of varying length, there is also inevitably a run-time step.


By the way, did you check whether your original code vectorizes? My experience with convolutions is that the proper ordering of iterator combinators results in vectorization even withouth explicit SIMD APIs, and is of course much more readable.

1 Like

Thanks, that clears it up! I guess that would require copying the chunks into the Simd for each iteration. Maybe that is not really an issue, and the chunks would have to be appropriately aligned for anything else to work.

Remember that floating-point addition isn't associative. So it's completely expected that you won't get the same answer. That's also why your original code probably isn't auto-vectorizing, since the vectorizing changes the order of the summation and thus isn't a legal optimization.

Trivial example: https://play.rust-lang.org/?version=nightly&mode=release&edition=2021&gist=0e608727b20332f446a623ecec010d5e

[src/main.rs:3] (x + 1.0) + 1.0 = 8388609.0
[src/main.rs:4] x + (1.0 + 1.0) = 8388610.0
1 Like

I'm guessing this is dependent on alignment. Maybe I can align the original data.

No, but SIMD sped it up almost 4x. I guess @scottmcm answers this. I always find it hard to know if optimizations are done, or if I do some trivial thing that misses the optimization.

That is really instructive. Maybe I'll try to align the data and coefficients on startup, e.g. store [[f32; 4]; NTAP] and cast to &[f32]. I probably won't be able to use the Dequeue then. Maybe first test is to use chunks and Simd::from.

Yeah, if you know the size, you can save yourself a bunch of work by just dealing in an array of the correct size.

For example, it might just be

pub fn dot<const N: usize>(x: &[f32; N], y: &[f32; N]) -> f32 {
    let (x, x_tail) = x.as_chunks::<4>();
    let (y, y_tail) = y.as_chunks::<4>();

    assert!(x_tail.is_empty() && y_tail.is_empty(), "N must be a multiple of 4");

    let mut sums = f32x4::splat(0.0);
    for (x, y) in std::iter::zip(x, y) {
        sums += f32x4::from_array(*x) * f32x4::from_array(*y);
    }
    
    sums.reduce_sum()
}

Which gives exactly the instructions you'd probably expect: https://rust.godbolt.org/z/38jj14YPh

Remember that you have to do that anyway. You don't have them in simd registers, so something has to put them there. Thus, as my example shows, the from_array calls don't matter.

(Well, if you just have a slice then it's movups instead of movaps, but on modern chips that doesn't matter. You could have your buffer be [f32x4; 1024] if you really wanted to have aligned loads -- I think stdsimd allows you to unsafely-but-soundly address that as &mut [f32; 4096].)

3 Likes

Oh, actually, to @H2CO3's previous point, it turns out you can write that without mentioning simd at all and LLVM will notice what you're doing and emit the same thing.

You just need to make it really obvious what order you want it to do things:

pub fn dot<const N: usize>(x: &[f32; N], y: &[f32; N]) -> f32 {
    let (x, x_tail) = x.as_chunks::<4>();
    let (y, y_tail) = y.as_chunks::<4>();

    assert!(x_tail.is_empty() && y_tail.is_empty(), "N must be a multiple of 4");

    let mut sums = [0.0; 4];
    for (x, y) in std::iter::zip(x, y) {
        let [x0, x1, x2, x3] = *x;
        let [y0, y1, y2, y3] = *y;
        let [p0, p1, p2, p3] = [x0 * y0, x1 * y1, x2 * y2, x3 * y3];
        sums[0] += p0;
        sums[1] += p1;
        sums[2] += p2;
        sums[3] += p3;
    }
    
    (sums[0] + sums[1]) + (sums[2] + sums[3])
}

And you get the same vectorized-and-unrolled inner loop: https://rust.godbolt.org/z/7aen7ofce

bb10.i:                                           ; preds = %bb10.i, %start
  %iter.sroa.8.013.i = phi i64 [ 0, %start ], [ %8, %bb10.i ]
  %0 = phi <4 x float> [ zeroinitializer, %start ], [ %14, %bb10.i ]
  %1 = or i64 %iter.sroa.8.013.i, 1, !dbg !36
  %2 = getelementptr inbounds [4 x float], ptr %x, i64 %iter.sroa.8.013.i, !dbg !38
  %3 = getelementptr inbounds [4 x float], ptr %y, i64 %iter.sroa.8.013.i, !dbg !58
  %4 = load <4 x float>, ptr %2, align 4, !dbg !62, !alias.scope !11, !noalias !15
  %5 = load <4 x float>, ptr %3, align 4, !dbg !64, !alias.scope !15, !noalias !11
  %6 = fmul <4 x float> %4, %5, !dbg !66
  %7 = fadd <4 x float> %0, %6, !dbg !68
  %8 = add nuw nsw i64 %iter.sroa.8.013.i, 2, !dbg !36
  %9 = getelementptr inbounds [4 x float], ptr %x, i64 %1, !dbg !38
  %10 = getelementptr inbounds [4 x float], ptr %y, i64 %1, !dbg !58
  %11 = load <4 x float>, ptr %9, align 4, !dbg !62, !alias.scope !11, !noalias !15
  %12 = load <4 x float>, ptr %10, align 4, !dbg !64, !alias.scope !15, !noalias !11
  %13 = fmul <4 x float> %11, %12, !dbg !66
  %14 = fadd <4 x float> %7, %13, !dbg !68
  %exitcond.not.i.1 = icmp eq i64 %8, 1024, !dbg !17
  br i1 %exitcond.not.i.1, label %example::dot.exit, label %bb10.i, !dbg !17
7 Likes

Thanks a lot. Not sure which is cleaner. The reason this isn't done without splitting is because of the invalid optimization/numerical inaccuracies? Ultimately this is going onto a thumbv7em-none-eabihf device. A fair amount of time is spent on filtering, so this is likely going to affect power usage (more time spent in asm::wfi).

I split the slices of the dequeue using as_simd and then I use array_chunks on COEFFS, and copy the array with f32x4::from_array. This ensures that the split between prefix, middle and suffix is matching. This performs pretty much as good as the faulty original SIMD code, and 4x the serial code. And I get to keep my dequeue which is pretty good since this is a rolling window of samples coming in from a sensor.

I tried just iterating over array chunks, which was faster but maybe 1.8x. Presumably because the iterator over the dequeue is not continuous.

        debug_assert_eq!(self.samples.len() % 4, 0);
        debug_assert_eq!(COEFFS.len() % 4, 0);
        debug_assert_eq!(COEFFS.len(), self.samples.len());

        let (f, b) = self.samples.as_slices();
        let (cf, cb) = COEFFS.split_at(f.len());

        debug_assert_eq!(f.len(), cf.len());
        debug_assert_eq!(b.len(), cb.len());

        // First half of dequeue
        let (p, m, s) = f.as_simd::<4>();
        let me = p.len() + m.len() * 4;
        let cp = &cf[..p.len()];
        let cm = cf[p.len()..me].array_chunks();
        let cs = &cf[me..];

        debug_assert_eq!(cp.len(), p.len());
        debug_assert_eq!(cm.len(), m.len());
        debug_assert_eq!(cs.len(), s.len());

        let sp = p.iter().zip(cp).fold(0.0, |a, (s, c)| a + (s * c));
        let ss = s.iter().zip(cs).fold(0.0, |a, (s, c)| a + (s * c));

        let fsums = f32x4::from_array([sp, 0.0, 0.0, ss]);
        let fsums = m
            .iter()
            .zip(cm)
            .fold(fsums, |a, (s, c)| a + (s * f32x4::from_array(*c)));

        // Second half of dequeue
        let (p, m, s) = b.as_simd::<4>();
        let me = p.len() + m.len() * 4;
        let cp = &cb[..p.len()];
        let cm = cb[p.len()..me].array_chunks();
        let cs = &cb[me..];
        debug_assert_eq!(cp.len(), p.len());
        debug_assert_eq!(cm.len(), m.len());
        debug_assert_eq!(cs.len(), s.len());

        let sp = p.iter().zip(cp).fold(0.0, |a, (s, c)| a + (s * c));
        let ss = s.iter().zip(cs).fold(0.0, |a, (s, c)| a + (s * c));

        let bsums = f32x4::from_array([sp, 0.0, 0.0, ss]);
        let bsums = m
            .iter()
            .zip(cm)
            .fold(bsums, |a, (s, c)| a + (s * f32x4::from_array(*c)));

        (fsums + bsums).reduce_sum()