Performance difference between iterator zip and skip order

veldsla · February 19, 2018, 12:49pm

I recenly wrote some naive code that used Vec indices to access data. I assumed using iterators would speed up the generated code, but the opposite happened. I think I found cause and made a testcase.

My code did a binary search to find a start position and then generally only a few iterations to find some matches. I process the resutls from two equal-length vectors. I can represent it using the following tests:

zip_skip zips two iterators and then skips a good portion followed by conditionally taking 100 elements
skip_zip uses skip on the individual iterators and then does the same
slice_zip slices the vectors before creating the iterators

    #[bench]
    fn zip_skip(b: &mut Bencher) {
        let v: Vec<_> = (0..100_000).collect();
        let t: Vec<_> = (0..100_000).collect();

        b.iter(|| {
            let s = v.iter().zip(t.iter()).skip(10000)
                .take_while(|t| *t.0 < 10100)
                .map(|(a, b)| *a + *b)
                .sum::<u64>();
            assert_eq!(s, 2009900);
        });
    }

    #[bench]
    fn skip_zip(b: &mut Bencher) {
        let v: Vec<_> = (0..100_000).collect();
        let t: Vec<_> = (0..100_000).collect();

        b.iter(|| {
            let s = v.iter().skip(10000).zip(t.iter().skip(10000))
                .take_while(|t| *t.0 < 10100)
                .map(|(a, b)| *a + *b)
                .sum::<u64>();
            assert_eq!(s, 2009900);
        });
    }


    #[bench]
    fn slice_zip(b: &mut Bencher) {
        let v: Vec<_> = (0..100_000).collect();
        let t: Vec<_> = (0..100_000).collect();

        b.iter(|| {
            let s = v[10_000..].iter().zip(t[10_000..].iter())
                .take_while(|t| *t.0 < 10100)
                .map(|(a, b)| *a + *b)
                .sum::<u64>();
            assert_eq!(s, 2009900);
        });
    }

Running this benchmark gives the following results:

running 3 tests
test tests::skip_zip  ... bench:         127 ns/iter (+/- 15)
test tests::slice_zip ... bench:          59 ns/iter (+/- 2)
test tests::zip_skip  ... bench:       5,059 ns/iter (+/- 330)

The summing in the main map probably not contributing a lot means we benchmark mainly the costs of the iterator? There is a huge difference in the zip_skip approach which to me is the one you would naturally write.

Does the skip on the zipped iterators indeed call next 10K times, while the iter().skip does the skip in a single step like the slice? Should I have known this?

skade · February 19, 2018, 12:55pm

Yes and yesish. In the case of zip_skip, you ask Rust to zip every two elements to then be able to inspect the tuple about whether it should remove it. This order is forcibly there, Rust doesn't reorder the operations there. So you are doing work before throwing away.

veldsla · February 19, 2018, 1:08pm

I suppose this makes sense. The skip is still before the take_while which actually wants to examine the tuple. I guess I was just hoping for too much of this zero cost abstraction goodness

Is there anything relevant in the difference between skip_zip and slice_zip? We are talking ns here, but pre-slicing appears twice as fast?

skade · February 19, 2018, 1:12pm

Well, the pre-slicing relies on the fact that the compiler knows this is a slice while the iterator is a little more general. I don't know if that case could be specialised.

scottmcm · February 20, 2018, 2:18am

zip is a bit awkward to get good perf from, since it can't leverage internal iteration [1] [2] as well as many of the other adaptors can.

Basically, it's often sad unless you hit the TrustedRandomAccess specialization: https://github.com/rust-lang/rust/blob/master/src/libcore/iter/mod.rs#L1139

Notably, that doesn't seem to override nth, which is the critical piece for Skip to be happy: https://github.com/rust-lang/rust/blob/master/src/libcore/iter/mod.rs#L2219

So there's probably some opportunities there around overriding/impling more things.

scottmcm · March 4, 2018, 1:21am

@veldsla A fix that made the bench 58x faster on my machine has been merged:

https://github.com/rust-lang/rust/pull/48635

It should be in 1.26.0-nightly (e026b59cf 2018-03-03); I'm curious whether it helps your original code.

veldsla · March 4, 2018, 11:27am

Very cool. Indeed in the synthetic benchmark slice_zip now performs identically to zip_skip.
skip_zip remains about two times slower.

On my original code the change is also quite noticeable, but the results vary based on the ratio between skip size and the number of iterations, direct indexing is still fastest, followed by slice_zip and then zip_skip and skip_zip.

Interesting result is that all my implementations seem to improve using the latest nightly according to my criterion benchmark:

find impls/zip skip
                        time:   [192.85 ns 193.98 ns 195.20 ns]
                        change: [-79.604% -79.469% -79.345%] (p = 0.00 < 0.05)
find impls/slice zip
                        time:   [154.96 ns 155.57 ns 156.28 ns]
                        change: [-46.671% -46.155% -45.596%] (p = 0.00 < 0.05)
find impls/skip zip
                        time:   [207.79 ns 208.36 ns 208.97 ns]
                        change: [-19.062% -18.734% -18.406%] (p = 0.00 < 0.05)
find impls/indexed
                        time:   [129.19 ns 129.49 ns 129.81 ns]
                        change: [-30.392% -29.947% -29.520%] (p = 0.00 < 0.05)

Topic		Replies	Views
[Solved]? Writing a generic iterator: filter and zip in place help	19	4172	January 12, 2023
Iterators vs loops, with skip help	5	826	February 21, 2021
Why is `skip` lazy? (Rationale for std::iter::Skip) help	12	526	September 25, 2024
Iterator that enables skipping over elements in a `Vec<u8>` help	20	1494	November 7, 2020
Running time of Vec<T>.iter().enumerate().map(...).skip(n) -- O(1) or O(n)?	10	5971	December 10, 2019

Performance difference between iterator zip and skip order

Related topics