How to “zip” two slices efficiently

I went on a similar exploration trying to improve dot product some time ago. I found that rustc isn't great at unrolling loops, so manually unrolling gives a signficant speed boost:

// I'd love it if this were the fastest version :(
fn naive_dot_product(x: &[f64], y: &[f64]) -> f64 {
    x.iter().zip(y.iter())
        .fold(0.0, |sum, (&ex, &ey)| sum + (ex * ey))
}

// The method you describe.
fn index_dot_product(x: &[f64], y: &[f64]) -> f64 {
    let n = cmp::min(x.len(), y.len());
    let (x, y) = (&x[..n], &y[..n]);
    let mut sum = 0.0;
    for i in 0..n {
        sum += x[i] * y[i];
    }
    sum
}

// Shift slices in place and add 8 elements at a time.
fn unrolled_dot_product(x: &[f64], y: &[f64]) -> f64 {
    let n = cmp::min(x.len(), y.len());
    let (mut x, mut y) = (&x[..n], &y[..n]);

    let mut sum = 0.0;
    while x.len() >= 8 {
        sum += x[0] * y[0] + x[1] * y[1] + x[2] * y[2] + x[3] * y[3]
             + x[4] * y[4] + x[5] * y[5] + x[6] * y[6] + x[7] * y[7];
        x = &x[8..];
        y = &y[8..];
    }

    // Take care of any left over elements (if len is not divisible by 8).
    x.iter().zip(y.iter()).fold(sum, |sum, (&ex, &ey)| sum + (ex * ey))
}

The results I get are (for arrays with 633965339 and 65527 elements):

test bench_naive    ... bench:   2,910,196 ns/iter (+/- 82,582)
test bench_index    ... bench:   2,562,601 ns/iter (+/- 87,100)
test bench_unrolled ... bench:     987,744 ns/iter (+/- 39,174)

Even for smaller vecs (259 and 253), the speed up is significant:

test bench_naive    ... bench:      13,449 ns/iter (+/- 151)
test bench_index    ... bench:      11,869 ns/iter (+/- 97)
test bench_unrolled ... bench:       4,994 ns/iter (+/- 64)

Code available at: Shared via Rust Playground · GitHub

4 Likes