Performance difference between iterator zip and skip order


#1

I recenly wrote some naive code that used Vec indices to access data. I assumed using iterators would speed up the generated code, but the opposite happened. I think I found cause and made a testcase.

My code did a binary search to find a start position and then generally only a few iterations to find some matches. I process the resutls from two equal-length vectors. I can represent it using the following tests:

  • zip_skip zips two iterators and then skips a good portion followed by conditionally taking 100 elements
  • skip_zip uses skip on the individual iterators and then does the same
  • slice_zip slices the vectors before creating the iterators
    #[bench]
    fn zip_skip(b: &mut Bencher) {
        let v: Vec<_> = (0..100_000).collect();
        let t: Vec<_> = (0..100_000).collect();

        b.iter(|| {
            let s = v.iter().zip(t.iter()).skip(10000)
                .take_while(|t| *t.0 < 10100)
                .map(|(a, b)| *a + *b)
                .sum::<u64>();
            assert_eq!(s, 2009900);
        });
    }

    #[bench]
    fn skip_zip(b: &mut Bencher) {
        let v: Vec<_> = (0..100_000).collect();
        let t: Vec<_> = (0..100_000).collect();

        b.iter(|| {
            let s = v.iter().skip(10000).zip(t.iter().skip(10000))
                .take_while(|t| *t.0 < 10100)
                .map(|(a, b)| *a + *b)
                .sum::<u64>();
            assert_eq!(s, 2009900);
        });
    }


    #[bench]
    fn slice_zip(b: &mut Bencher) {
        let v: Vec<_> = (0..100_000).collect();
        let t: Vec<_> = (0..100_000).collect();

        b.iter(|| {
            let s = v[10_000..].iter().zip(t[10_000..].iter())
                .take_while(|t| *t.0 < 10100)
                .map(|(a, b)| *a + *b)
                .sum::<u64>();
            assert_eq!(s, 2009900);
        });
    }

Running this benchmark gives the following results:

running 3 tests
test tests::skip_zip  ... bench:         127 ns/iter (+/- 15)
test tests::slice_zip ... bench:          59 ns/iter (+/- 2)
test tests::zip_skip  ... bench:       5,059 ns/iter (+/- 330)

The summing in the main map probably not contributing a lot means we benchmark mainly the costs of the iterator? There is a huge difference in the zip_skip approach which to me is the one you would naturally write.

Does the skip on the zipped iterators indeed call next 10K times, while the iter().skip does the skip in a single step like the slice? Should I have known this?


#2

Yes and yesish. In the case of zip_skip, you ask Rust to zip every two elements to then be able to inspect the tuple about whether it should remove it. This order is forcibly there, Rust doesn’t reorder the operations there. So you are doing work before throwing away.


#3

I suppose this makes sense. The skip is still before the take_while which actually wants to examine the tuple. I guess I was just hoping for too much of this zero cost abstraction goodness :smile:

Is there anything relevant in the difference between skip_zip and slice_zip? We are talking ns here, but pre-slicing appears twice as fast?


#4

Well, the pre-slicing relies on the fact that the compiler knows this is a slice while the iterator is a little more general. I don’t know if that case could be specialised.


#5

zip is a bit awkward to get good perf from, since it can’t leverage internal iteration [1] [2] as well as many of the other adaptors can.

Basically, it’s often sad unless you hit the TrustedRandomAccess specialization: https://github.com/rust-lang/rust/blob/master/src/libcore/iter/mod.rs#L1139

Notably, that doesn’t seem to override nth, which is the critical piece for Skip to be happy: https://github.com/rust-lang/rust/blob/master/src/libcore/iter/mod.rs#L2219

So there’s probably some opportunities there around overriding/impling more things.


#6

@veldsla A fix that made the bench 58x faster on my machine has been merged:

It should be in 1.26.0-nightly (e026b59cf 2018-03-03); I’m curious whether it helps your original code.


#7

Very cool. Indeed in the synthetic benchmark slice_zip now performs identically to zip_skip.
skip_zip remains about two times slower.

On my original code the change is also quite noticeable, but the results vary based on the ratio between skip size and the number of iterations, direct indexing is still fastest, followed by slice_zip and then zip_skip and skip_zip.

Interesting result is that all my implementations seem to improve using the latest nightly according to my criterion benchmark:

find impls/zip skip
                        time:   [192.85 ns 193.98 ns 195.20 ns]
                        change: [-79.604% -79.469% -79.345%] (p = 0.00 < 0.05)
find impls/slice zip
                        time:   [154.96 ns 155.57 ns 156.28 ns]
                        change: [-46.671% -46.155% -45.596%] (p = 0.00 < 0.05)
find impls/skip zip
                        time:   [207.79 ns 208.36 ns 208.97 ns]
                        change: [-19.062% -18.734% -18.406%] (p = 0.00 < 0.05)
find impls/indexed
                        time:   [129.19 ns 129.49 ns 129.81 ns]
                        change: [-30.392% -29.947% -29.520%] (p = 0.00 < 0.05)