I have a performance-sensitive project where profiling has revealed the clear bottleneck to be a function similar to this (called on the order of billions of times):
fn multiply_assign_into_array<const N: usize, const M: usize>(
a: &[[f64; N]; M],
x: &[f32; M],
y: &[f32; N],
into: &mut [[f64; N]; M],
) {
for j in 0..M {
for i in 0..N {
into[j][i] = a[j][i] * x[j] as f64 * y[i] as f64;
}
}
}
Unfortunately, I do not know N
and M
at compile time, and so I have to resort to something like the following:
fn multiply_assign_into_slice(a: &[f64], x: &[f32], y: &[f32], into: &mut [f64]) {
let n = y.len();
for (i, x) in x.iter().enumerate() {
for (j, y) in y.iter().enumerate() {
into[j + i * n] = a[j + i * n] * *x as f64 * *y as f64;
}
}
}
My benchmarks show that this is on the order of 5-6 times slower (~15ns vs ~90ns). If my understanding is correct (?), the memory layout of a
, x
, y
, and into
should be the same in both cases, so I'm thinking the performance gap comes down to a combination of 1) the advantage of having the array on the stack rather than the heap, and/or 2) compiler optimisations like bound checks and vectorisation). This led me to hope that I could narrow the performance gap by helping the compiler realise what's going on, and I have tried various things (rewriting with iterators, tricks like let a = &a[0..x.len() * y.len()];
, etc.), but nothing really seems to help. Unsafe unchecked indexing does help a little, but not much.
My question, therefore, is whether there is any way to achieve a performance on the slice version of the function closer to the array version, or is this is hoping for too much? I'd accept unsafe
code here, if that's required, and test all the more thoroughly in that case.
I should also say that I have tried various attempts using the ndarray
crate, which seems like an obvious candidate here, but all had worse performance than the slice version.