How to “zip” two slices efficiently

Update in March 2017

Rust now implements .zip() specially for a couple of core iterators, including slice iterators (iter and iter_mut) as well as a few adaptors of those (map, cloned, zip, maybe more). For these iterators zip will now provide satisfactory loop optimization, for example allowing for auto-vectorization of certain loops.

The kind of exploration done below the line is still relevant for special cases but would need to be updated for the current state of Rust. Happy hacking -bluss


Me and @dotdash have been looking into the performance of different ways to “zip” iterators. Like you probably know, the preferred way to iterate two slices or vectors, should be like this:

fn iterate_slices_with_zip<T>(xs: &[T], ys: &[T]) {
    for (x, y) in xs.iter().zip(ys) {
        /* loop body here */
    }
}

.zip() ignores elements past the end of the shortest of the two inputs.

However, the Zip iterator adaptor does not optimize as well as we’d like. It needs to check first iterator if it has reached the end, then check the second iterator if it has reached the end, and then it can yield the next element. What we need is that the optimizer realizes when these double checks are redundant.

I don’t know if we are going to be able to resolve it in geneneral implementation. With specialization, if available in the future, we will at least be able to address it for the common case of iterating two slices in parallel.

If we “brute force” the problem it looks like this: use unsafe!

fn iterate_slices_counted_unchecked<T>(xs: &[T], ys: &[T]) {
    let len = cmp::min(xs.len(), ys.len());
    for i in 0..len {
        unsafe {
            let x = &*xs.get_unchecked(i);
            let y = &*ys.get_unchecked(i);
        }
    }
}

The unchecked loop does indeed turn into what looks like a loop without redundant overhead.

What happens if we try to not use unsafe, shouldn't the optimizer still be able to elide the bounds checks?

fn iterate_slices_counted<T>(xs: &[T], ys: &[T]) {
    let len = cmp::min(xs.len(), ys.len());
    for i in 0..len {
        let x = &xs[i];
        let y = &ys[i];
    }
}

It turns out it doesn't! For some reason, the information that i < xs.len() && i < ys.len() does not reach the loop body.

We found a workaround! Be abundantly clear:

fn iterate_slices_counted_2<T>(xs: &[T], ys: &[T]) {
    let len = cmp::min(xs.len(), ys.len());
    let xs = &xs[..len];
    let ys = &ys[..len];
    for i in 0..len {
        let x = &xs[i];
        let y = &ys[i];
    }
}

Now it works! iterate_slices_counted_2 elides the bounds checks and does perform just as well as iterate_slices_counted_unchecked. All it took was saying it just the right way.

Sigh, optimizing compilers are mercurial beasts..

  • A few developers are looking into our llvm optimization passes, which may have a big impact and make it infer rust idioms a bit better, and elide for example bounds checks in more places.

  • .zip() will hopefully improve, but the outlook for that is unclear.

  • I published a safe wrapper around iterate_slices_counted_unchecked, it is called ZipSlices in itertools. If you have critical lock step loops, you might consider profiling and adopting ZipSlices or just the the solutions mentioned here if needed.

  • (Before the counted version of ZipSlices, I had a version with three pointers: start of slice 1, end of slice 1, start of slice 2. Llvm did not appreciate this as well as it did the counted loop!)

  • Benchmark suite for this is in the itertools repo.

  • @dotdash improved the slice equality operator dramatically

Performance evaluations are as of rustc 1.3.0-nightly (20f421cd5 2015-07-06)

12 Likes

I went on a similar exploration trying to improve dot product some time ago. I found that rustc isn't great at unrolling loops, so manually unrolling gives a signficant speed boost:

// I'd love it if this were the fastest version :(
fn naive_dot_product(x: &[f64], y: &[f64]) -> f64 {
    x.iter().zip(y.iter())
        .fold(0.0, |sum, (&ex, &ey)| sum + (ex * ey))
}

// The method you describe.
fn index_dot_product(x: &[f64], y: &[f64]) -> f64 {
    let n = cmp::min(x.len(), y.len());
    let (x, y) = (&x[..n], &y[..n]);
    let mut sum = 0.0;
    for i in 0..n {
        sum += x[i] * y[i];
    }
    sum
}

// Shift slices in place and add 8 elements at a time.
fn unrolled_dot_product(x: &[f64], y: &[f64]) -> f64 {
    let n = cmp::min(x.len(), y.len());
    let (mut x, mut y) = (&x[..n], &y[..n]);

    let mut sum = 0.0;
    while x.len() >= 8 {
        sum += x[0] * y[0] + x[1] * y[1] + x[2] * y[2] + x[3] * y[3]
             + x[4] * y[4] + x[5] * y[5] + x[6] * y[6] + x[7] * y[7];
        x = &x[8..];
        y = &y[8..];
    }

    // Take care of any left over elements (if len is not divisible by 8).
    x.iter().zip(y.iter()).fold(sum, |sum, (&ex, &ey)| sum + (ex * ey))
}

The results I get are (for arrays with 633965339 and 65527 elements):

test bench_naive    ... bench:   2,910,196 ns/iter (+/- 82,582)
test bench_index    ... bench:   2,562,601 ns/iter (+/- 87,100)
test bench_unrolled ... bench:     987,744 ns/iter (+/- 39,174)

Even for smaller vecs (259 and 253), the speed up is significant:

test bench_naive    ... bench:      13,449 ns/iter (+/- 151)
test bench_index    ... bench:      11,869 ns/iter (+/- 97)
test bench_unrolled ... bench:       4,994 ns/iter (+/- 64)

Code available at: Shared via Rust Playground · GitHub

4 Likes

That's strange, I have this testcase which vectorizes in glorious fashion. Only obvious difference is integer vs. float.

#[bench]
fn zipdot_checked_counted_loop(b: &mut test::Bencher)
{
    let xs = vec![0; 1024];
    let ys = vec![0; 768];
    let xs = black_box(xs);
    let ys = black_box(ys);

    b.iter(|| {
        // Must slice to equal lengths, and then bounds checks are eliminated!
        let len = cmp::min(xs.len(), ys.len());
        let xs = &xs[..len];
        let ys = &ys[..len];

        let mut s = 0;

        for i in 0..len {
            let x = xs[i];
            let y = ys[i];
            s += x * y;
        }
        s
    })
}

For some reason, the float version just doesn't optimize like the integer version does.

Is this a "-ffast-math" type of problem? Llvm doesn't want to do the vectorization since it may change the floating point result slightly(?)

Edit: It's a -ffast-math type of problem, documented here

test zipdot_f32_checked_counted_loop   ... bench:       1,347 ns/iter (+/- 664)
test zipdot_f32_default_zip            ... bench:       1,392 ns/iter (+/- 13)
test zipdot_f32_unchecked_counted_loop ... bench:       1,343 ns/iter (+/- 371)
test zipdot_f32_zipslices              ... bench:       1,342 ns/iter (+/- 466)
test zipdot_f32_ziptrusted             ... bench:       1,342 ns/iter (+/- 387)
test zipdot_i32_checked_counted_loop   ... bench:         380 ns/iter (+/- 113)
test zipdot_i32_default_zip            ... bench:       1,401 ns/iter (+/- 27)
test zipdot_i32_unchecked_counted_loop ... bench:         308 ns/iter (+/- 154)
test zipdot_i32_zipslices              ... bench:         380 ns/iter (+/- 134)
test zipdot_i32_ziptrusted             ... bench:         301 ns/iter (+/- 148)
2 Likes

Closing the link loop: This issue tracks adding some kind of fast-math semantics to Rust, possibly using a wrapper type.

Great find! It's sad that it doesn't at least get unrolled though (since that should keep the float operation order, right?)

That's a good point, it should be unrolled.

This looks a lot like it is being compiled without optimizations. Here are the times I get with a recent rustc (1.4.0-dev (84e50ca14 2015-08-16):

Small vectors:

Non-Optimized:

test bench_index    ... bench:       8,154 ns/iter (+/- 52)
test bench_naive    ... bench:       9,138 ns/iter (+/- 86)
test bench_unrolled ... bench:       3,388 ns/iter (+/- 28)

Optimized:

test bench_index    ... bench:         161 ns/iter (+/- 8)
test bench_naive    ... bench:         170 ns/iter (+/- 3)
test bench_unrolled ... bench:          75 ns/iter (+/- 2)

Large vectors:

Non-Optimized:

test bench_index    ... bench:     201,166 ns/iter (+/- 791)
test bench_naive    ... bench:     227,773 ns/iter (+/- 10,348)
test bench_unrolled ... bench:      80,027 ns/iter (+/- 2,402)

Optimized:

test bench_index    ... bench:       4,237 ns/iter (+/- 186)
test bench_naive    ... bench:       4,248 ns/iter (+/- 93)
test bench_unrolled ... bench:       2,655 ns/iter (+/- 121)

Interestingly, for the large vectors my numbers are still 10 times better than yours, even without optimizations. Makes me wonder what hardware you're testing this on. I'm using an i7 3770K @4.5GHz.

Arggh, serves me right for not using cargo... I'm used to cargo bench magically putting in optimization flags. Here are my optimized numbers (rustc -O --test dotprod.rs && ./dotprod --bench). Less impressive speedup, but still very significant:

test bench_index    ... bench:         231 ns/iter (+/- 3)
test bench_naive    ... bench:         239 ns/iter (+/- 85)
test bench_unrolled ... bench:          87 ns/iter (+/- 17)

and

test bench_index    ... bench:      60,053 ns/iter (+/- 527)
test bench_naive    ... bench:      51,637 ns/iter (+/- 2,902)
test bench_unrolled ... bench:      35,817 ns/iter (+/- 2,022)

for 65339 and 65527 elements (I think your large vectors are smaller than my large vectors because I had a typo in my previous comment). I remembered that there was a speedup from when I was playing with this before, so when I saw numbers confirming that I didn't think any further. mea culpa

You found a way to write an unrolled loop where bounds checks are elided anyway, that's really nice.

Does LLVM's Inductive Range Check Elimination pass pick this pattern up?

@cristicbz Try this version, it vectorizes properly. Just have the floating point rules in mind and write out everything in an order compatible with typical 4- or 8-lane f32 vector operations!

Runtime was cut almost in half here for this dot product, just by breaking up the summation into the 8 parts.

#[bench]
fn zipdot_f32_checked_counted_unrolled_loop(b: &mut test::Bencher)
{
    let xs = vec![2f32; 1024];
    let ys = vec![2f32; 768];
    let xs = black_box(xs);
    let ys = black_box(ys);

    b.iter(|| {
        // Must slice to equal lengths, and then bounds checks are eliminated!
        let len = cmp::min(xs.len(), ys.len());
        let mut xs = &xs[..len];
        let mut ys = &ys[..len];

        let mut s = 0.;
        let (mut p0, mut p1, mut p2, mut p3, mut p4, mut p5, mut p6, mut p7) =
            (0., 0., 0., 0., 0., 0., 0., 0.);

        while xs.len() >= 8 {
            p0 += xs[0] * ys[0];
            p1 += xs[1] * ys[1];
            p2 += xs[2] * ys[2];
            p3 += xs[3] * ys[3];
            p4 += xs[4] * ys[4];
            p5 += xs[5] * ys[5];
            p6 += xs[6] * ys[6];
            p7 += xs[7] * ys[7];

            xs = &xs[8..];
            ys = &ys[8..];
        }
        s += p0 + p4;
        s += p1 + p5;
        s += p2 + p6;
        s += p3 + p7;

        for i in 0..xs.len() {
            s += xs[i] * ys[i];
        }
        s
    })
}
1 Like

Ooooh, awesome!

Update: In current rust, itertools::ZipSlices does not codegen well enough anymore, probably due to the removal of a null check elimination llvm pass. (It's too late to edit the original post.)

The indexed loop is still the way to go, making sure to slice to equal length, so that the loop bound and slice bounds check coincide (and are then elided).

fn iterate_slices_counted_2<T>(xs: &[T], ys: &[T]) {
    let len = cmp::min(xs.len(), ys.len());
    let xs = &xs[..len];
    let ys = &ys[..len];
    for i in 0..len {
        let x = &xs[i];
        let y = &ys[i];
        /* do something with `x` and `y` */
    }
}
2 Likes