Improving performance of slice 2d multiplication (relative to array equivalent)

MSR · April 11, 2021, 1:49pm

I have a performance-sensitive project where profiling has revealed the clear bottleneck to be a function similar to this (called on the order of billions of times):

fn multiply_assign_into_array<const N: usize, const M: usize>(
    a: &[[f64; N]; M],
    x: &[f32; M],
    y: &[f32; N],
    into: &mut [[f64; N]; M],
) {
    for j in 0..M {
        for i in 0..N {
            into[j][i] = a[j][i] * x[j] as f64 * y[i] as f64;
        }
    }
}

Unfortunately, I do not know N and M at compile time, and so I have to resort to something like the following:

fn multiply_assign_into_slice(a: &[f64], x: &[f32], y: &[f32], into: &mut [f64]) {
    let n = y.len();

    for (i, x) in x.iter().enumerate() {
        for (j, y) in y.iter().enumerate() {
            into[j + i * n] = a[j + i * n] * *x as f64 * *y as f64;
        }
    }
}

Playground.

My benchmarks show that this is on the order of 5-6 times slower (~15ns vs ~90ns). If my understanding is correct (?), the memory layout of a, x, y, and into should be the same in both cases, so I'm thinking the performance gap comes down to a combination of 1) the advantage of having the array on the stack rather than the heap, and/or 2) compiler optimisations like bound checks and vectorisation). This led me to hope that I could narrow the performance gap by helping the compiler realise what's going on, and I have tried various things (rewriting with iterators, tricks like let a = &a[0..x.len() * y.len()];, etc.), but nothing really seems to help. Unsafe unchecked indexing does help a little, but not much.

My question, therefore, is whether there is any way to achieve a performance on the slice version of the function closer to the array version, or is this is hoping for too much? I'd accept unsafe code here, if that's required, and test all the more thoroughly in that case.

I should also say that I have tried various attempts using the ndarray crate, which seems like an obvious candidate here, but all had worse performance than the slice version.

alice · April 11, 2021, 1:57pm

If unsafe indexing doesn't help, it was probably making optimizations based on the actual value of N and M, whereas with the slice, it has to emit an actual multiply instruction. You can often do better if you know what you are multiplying by at compile time.

MSR · April 11, 2021, 2:05pm

In other words, if I understand you correctly, the compile time constants enable a class of compiler optimisations that are simply not possible to induce on the slice version?

alice · April 11, 2021, 2:06pm

Yes. For example, if N is 1024, multiplying by it can be done with a left-shift, which is a lot cheaper than a multiply instruction.

If you know that N often takes values from a small set, you could have a hard-coded version of the function for each such N, and call the appropriate one when it matches one of them.

MSR · April 11, 2021, 2:18pm

Aha, I hadn't thought about that. Makes sense though.

Interesting suggestion to try and match on the values of N and M and forward to array functions. I probably have ~20 values of N and M that I expect to see (i.e. ~400 combinations), but I suppose that's nothing a declarative macro can't fix. I guess the branch predictor should be able to efficiently deal with the repeated match?

alice · April 11, 2021, 2:24pm

It seems like your code only multiplies with N, not M, so you only really need to match on one of the two. Furthermore, it seems to me that you should be able to use the new const-generics feature for it.

fn multiply_assign_into_slice2<const N: usize>(
    a: &[[f64; N]],
    x: &[f32],
    y: &[f32; N],
    into: &mut [[f64; N]]
) {
    for (i, x) in x.iter().enumerate() {
        for (j, y) in y.iter().enumerate() {
            into[j][i] = a[j][i] * *x as f64 * *y as f64;
        }
    }
}

You can call it with multiply_assign_into_slice2::<4>(...).

MSR · April 11, 2021, 2:43pm

Right, good point, matching on just a single const generic does make it more manageable. I quickly benchmarked the following version of that idea, which is a little bit slower than the full array version, but still ~4 times faster than the original (and probably with scope for more improvements):

fn multiply_assign_into_slice2<const N: usize>(
    a: &[[f64; N]],
    x: &[f32],
    y: &[f32; N],
    into: &mut [[f64; N]]
) {
    for (j, x) in x.iter().enumerate() {
        for i in 0..N {
            into[j][i] = a[j][i] * *x as f64 * y[i] as f64;
        }
    }
}

I guess it'll also require a match and some unsafe transmuting slices to slices of arrays, but it should still be significantly faster when all is said and done.

Thanks a lot for your input!

alice · April 11, 2021, 3:06pm

Yeah, I don't think the standard library lets you do that transmute directly, although there are probably crates that can do it. Anyway, this would be safe:

#[inline]
fn as_matrix<const N: usize>(m: &[f64]) -> &[[f64; N]] {
    assert!(m.len() % N == 0);
    let len = m.len() / N;
    let ptr = m.as_ptr().cast::<[f64; N]>();
    
    unsafe {
        std::slice::from_raw_parts(ptr, len)
    }
}

cuviper · April 11, 2021, 3:15pm

There's the unstable as_chunks for that conversion.

SkiFire13 · April 11, 2021, 3:23pm

How about something like:

pub fn multiply_assign_into_slice(a: &[f64], x: &[f32], y: &[f32], into: &mut [f64]) {
    let n = y.len();
    let m = x.len();
    
    assert_eq!(a.len(), n * m);
    assert_eq!(into.len(), n * m);
    
    into
        .chunks_mut(n)
        .zip(a.chunks(n))
        .zip(x)
        .for_each(|((into, a), &x)| {
            into
                .iter_mut()
                .zip(a)
                .zip(y)
                .for_each(|((into, &a), &y)| {
                    *into = a * x as f64 * y as f64;
                })
        })
}

MSR · April 11, 2021, 3:36pm

Thank you! From a quick benchmark (with n = 9 and m = 10), this is possibly a hair faster than the original, but within the margin of variation I get on a criterion benchmark, and a bit slower than unchecked indexing in a loop. I would've also assumed that something like this would be faster, but I've had the same result with my forays into iterator-based solutions.

alice · April 11, 2021, 3:37pm

Generally, the advantage you get by iterators is that there are less bounds checks, so I would be surprised if it beat the unchecked indexing solution.

MSR · April 11, 2021, 3:39pm

I was under the impression that the compiler is also sometimes better at vectorisation with iterators? At least, that was why I also had the instinct to try iterators.

alice · April 11, 2021, 3:40pm

I may be wrong, but I don't think that's the case. As I understand it, the main challenge against vectorization is that you can't vectorize something with bounds-checks, so anything without the bounds-checks should vectorize better.

tuxmain · April 11, 2021, 8:02pm

You can try to vectorize by hand and use SIMD, or just use faster. It may be faster by a factor of up to 4 (or 8 if your CPU has AVX-512 feature, check that in /proc/cpuinfo).

system · July 10, 2021, 8:02pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Working with arrays/vecs in functions help	8	1849	December 3, 2019
How can i speed up the div?	24	749	April 6, 2023
Best way to create Matrix of Matrices help	34	1817	November 30, 2020
Optimize array sums	27	1742	January 10, 2021
Ndarray arithmetic between rows help	10	411	September 25, 2024

Improving performance of slice 2d multiplication (relative to array equivalent)

Related topics