I went on a similar exploration trying to improve dot product some time ago. I found that rustc isn't great at unrolling loops, so manually unrolling gives a signficant speed boost:
// I'd love it if this were the fastest version :(
fn naive_dot_product(x: &[f64], y: &[f64]) -> f64 {
x.iter().zip(y.iter())
.fold(0.0, |sum, (&ex, &ey)| sum + (ex * ey))
}
// The method you describe.
fn index_dot_product(x: &[f64], y: &[f64]) -> f64 {
let n = cmp::min(x.len(), y.len());
let (x, y) = (&x[..n], &y[..n]);
let mut sum = 0.0;
for i in 0..n {
sum += x[i] * y[i];
}
sum
}
// Shift slices in place and add 8 elements at a time.
fn unrolled_dot_product(x: &[f64], y: &[f64]) -> f64 {
let n = cmp::min(x.len(), y.len());
let (mut x, mut y) = (&x[..n], &y[..n]);
let mut sum = 0.0;
while x.len() >= 8 {
sum += x[0] * y[0] + x[1] * y[1] + x[2] * y[2] + x[3] * y[3]
+ x[4] * y[4] + x[5] * y[5] + x[6] * y[6] + x[7] * y[7];
x = &x[8..];
y = &y[8..];
}
// Take care of any left over elements (if len is not divisible by 8).
x.iter().zip(y.iter()).fold(sum, |sum, (&ex, &ey)| sum + (ex * ey))
}
The results I get are (for arrays with 633965339 and 65527 elements):
test bench_naive ... bench: 2,910,196 ns/iter (+/- 82,582)
test bench_index ... bench: 2,562,601 ns/iter (+/- 87,100)
test bench_unrolled ... bench: 987,744 ns/iter (+/- 39,174)
Even for smaller vecs (259 and 253), the speed up is significant:
test bench_naive ... bench: 13,449 ns/iter (+/- 151)
test bench_index ... bench: 11,869 ns/iter (+/- 97)
test bench_unrolled ... bench: 4,994 ns/iter (+/- 64)
Code available at: Shared via Rust Playground · GitHub