Improving performance of slice 2d multiplication (relative to array equivalent)

I have a performance-sensitive project where profiling has revealed the clear bottleneck to be a function similar to this (called on the order of billions of times):

fn multiply_assign_into_array<const N: usize, const M: usize>(
    a: &[[f64; N]; M],
    x: &[f32; M],
    y: &[f32; N],
    into: &mut [[f64; N]; M],
) {
    for j in 0..M {
        for i in 0..N {
            into[j][i] = a[j][i] * x[j] as f64 * y[i] as f64;
        }
    }
}

Unfortunately, I do not know N and M at compile time, and so I have to resort to something like the following:

fn multiply_assign_into_slice(a: &[f64], x: &[f32], y: &[f32], into: &mut [f64]) {
    let n = y.len();

    for (i, x) in x.iter().enumerate() {
        for (j, y) in y.iter().enumerate() {
            into[j + i * n] = a[j + i * n] * *x as f64 * *y as f64;
        }
    }
}

Playground.

My benchmarks show that this is on the order of 5-6 times slower (~15ns vs ~90ns). If my understanding is correct (?), the memory layout of a, x, y, and into should be the same in both cases, so I'm thinking the performance gap comes down to a combination of 1) the advantage of having the array on the stack rather than the heap, and/or 2) compiler optimisations like bound checks and vectorisation). This led me to hope that I could narrow the performance gap by helping the compiler realise what's going on, and I have tried various things (rewriting with iterators, tricks like let a = &a[0..x.len() * y.len()];, etc.), but nothing really seems to help. Unsafe unchecked indexing does help a little, but not much.

My question, therefore, is whether there is any way to achieve a performance on the slice version of the function closer to the array version, or is this is hoping for too much? I'd accept unsafe code here, if that's required, and test all the more thoroughly in that case.

I should also say that I have tried various attempts using the ndarray crate, which seems like an obvious candidate here, but all had worse performance than the slice version.

If unsafe indexing doesn't help, it was probably making optimizations based on the actual value of N and M, whereas with the slice, it has to emit an actual multiply instruction. You can often do better if you know what you are multiplying by at compile time.

In other words, if I understand you correctly, the compile time constants enable a class of compiler optimisations that are simply not possible to induce on the slice version?

Yes. For example, if N is 1024, multiplying by it can be done with a left-shift, which is a lot cheaper than a multiply instruction.

If you know that N often takes values from a small set, you could have a hard-coded version of the function for each such N, and call the appropriate one when it matches one of them.

Aha, I hadn't thought about that. Makes sense though.

Interesting suggestion to try and match on the values of N and M and forward to array functions. I probably have ~20 values of N and M that I expect to see (i.e. ~400 combinations), but I suppose that's nothing a declarative macro can't fix. I guess the branch predictor should be able to efficiently deal with the repeated match?

It seems like your code only multiplies with N, not M, so you only really need to match on one of the two. Furthermore, it seems to me that you should be able to use the new const-generics feature for it.

fn multiply_assign_into_slice2<const N: usize>(
    a: &[[f64; N]],
    x: &[f32],
    y: &[f32; N],
    into: &mut [[f64; N]]
) {
    for (i, x) in x.iter().enumerate() {
        for (j, y) in y.iter().enumerate() {
            into[j][i] = a[j][i] * *x as f64 * *y as f64;
        }
    }
}

You can call it with multiply_assign_into_slice2::<4>(...).

1 Like

Right, good point, matching on just a single const generic does make it more manageable. I quickly benchmarked the following version of that idea, which is a little bit slower than the full array version, but still ~4 times faster than the original (and probably with scope for more improvements):

fn multiply_assign_into_slice2<const N: usize>(
    a: &[[f64; N]],
    x: &[f32],
    y: &[f32; N],
    into: &mut [[f64; N]]
) {
    for (j, x) in x.iter().enumerate() {
        for i in 0..N {
            into[j][i] = a[j][i] * *x as f64 * y[i] as f64;
        }
    }
}

I guess it'll also require a match and some unsafe transmuting slices to slices of arrays, but it should still be significantly faster when all is said and done.

Thanks a lot for your input!

Yeah, I don't think the standard library lets you do that transmute directly, although there are probably crates that can do it. Anyway, this would be safe:

#[inline]
fn as_matrix<const N: usize>(m: &[f64]) -> &[[f64; N]] {
    assert!(m.len() % N == 0);
    let len = m.len() / N;
    let ptr = m.as_ptr().cast::<[f64; N]>();
    
    unsafe {
        std::slice::from_raw_parts(ptr, len)
    }
}
1 Like

There's the unstable as_chunks for that conversion.

3 Likes

How about something like:

pub fn multiply_assign_into_slice(a: &[f64], x: &[f32], y: &[f32], into: &mut [f64]) {
    let n = y.len();
    let m = x.len();
    
    assert_eq!(a.len(), n * m);
    assert_eq!(into.len(), n * m);
    
    into
        .chunks_mut(n)
        .zip(a.chunks(n))
        .zip(x)
        .for_each(|((into, a), &x)| {
            into
                .iter_mut()
                .zip(a)
                .zip(y)
                .for_each(|((into, &a), &y)| {
                    *into = a * x as f64 * y as f64;
                })
        })
}

Thank you! From a quick benchmark (with n = 9 and m = 10), this is possibly a hair faster than the original, but within the margin of variation I get on a criterion benchmark, and a bit slower than unchecked indexing in a loop. I would've also assumed that something like this would be faster, but I've had the same result with my forays into iterator-based solutions.

Generally, the advantage you get by iterators is that there are less bounds checks, so I would be surprised if it beat the unchecked indexing solution.

I was under the impression that the compiler is also sometimes better at vectorisation with iterators? At least, that was why I also had the instinct to try iterators.

I may be wrong, but I don't think that's the case. As I understand it, the main challenge against vectorization is that you can't vectorize something with bounds-checks, so anything without the bounds-checks should vectorize better.

1 Like

You can try to vectorize by hand and use SIMD, or just use faster. It may be faster by a factor of up to 4 (or 8 if your CPU has AVX-512 feature, check that in /proc/cpuinfo).