We all know `iter` is faster than `loop`, but why?

Iter is (slightly) faster than loop('for', 'while' and 'loop'), I learned that from many documentations, but none explains why.

A naive test code like this could tell iter is a little bit faster:

use std::time::Instant;

const VEC_SIZE: usize = 10000000;

fn use_iter(vec: &mut Vec<usize>) {
    vec.iter_mut().enumerate().for_each(|(i, v)| *v = i);
}

fn use_for(vec: &mut Vec<usize>) {
    for i in 0..vec.len() {
        vec[i] = i;
    }
}

fn use_while(vec: &mut Vec<usize>) {
    let mut i = 0;
    let len = vec.len();
    while i < len {
        vec[i] = i;
        i += 1;
    }
}

fn main() {
    let mut vec = vec![0; VEC_SIZE];

    // Warm up a little
    for _ in 0.. 10 {
        use_while(&mut vec);
    }

    let now = Instant::now();
    for _ in 0.. 10 {
        use_iter(&mut vec);
    }
    let elapsed = now.elapsed();
    println!("time elapsed using iter:  {:?}", elapsed);

    let now = Instant::now();
    for _ in 0.. 10 {
        use_for(&mut vec);
    }
    let elapsed = now.elapsed();
    println!("time elapsed using for:   {:?}", elapsed);

    let now = Instant::now();
    for _ in 0.. 10 {
        use_while(&mut vec);
    }
    let elapsed = now.elapsed();
    println!("time elapsed using while: {:?}", elapsed);
}

But what I am courious about is why iter is faster, they all have to visit same memory address and perform same operation. And loop is so direct, yet iter might even need to run more code (I don't know if this is true in CPU instruction level, so I say might.), because we need to make a iteration struct out of it and call next() to iterate, how could it be faster?

I noticed that the assembly code generated by rustc is different using iter and loop, but it's way to complicated for me to understand.

Here is some of my guesses:

  1. compiler optimization (but based on what ? Why loop cannot be optimized?)
  2. no boundry check ?
  3. something ralated mutability or memory access?

I hope someone had studied this and got an answer to share.

1 Like

No boundary checks is large one for slice (and thus Vec) iterators; they know the bounds ahead of time and the bounds aren't going to change during iteration due to Rust's borrowing rules. (If you're lucky, your index-based iteration might also avoid bounds checks, if the optimizer happens to notice your indices are always in bound.)

There are also various other traits that optimizations can take advantage of, depending on the situation. For example, if you are using .collect() on an iterator that implements ExactSizeIterator, allocation of the created collection can be accurately done ahead of time.

Some of these optimizations you could do normally yourself, e.g. if specific traits help your use case. Others are done with unsafe, which you could do to, with the proper amount of care (and risk). And more others are done with language features like specialization which have not yet made it to the stable compiler.

This is a nuanced test for silly reasons. The functions are taking &mut Vec<usize> but not changing the length of the Vec. If you change them to take &mut [usize] instead -- nothing else needs to change, as the caller continues to work -- then they end up all being equally fast (within error): Rust Playground

So the difference is that the compiler can't currently elide the bounds checks in the exact case, though it can in very similar ones.

Those bounds checks tend not to matter in code with a less-tiny body, because the branch predictor predicts them correctly anyway, but here the body is so small that micro-differences matter.

5 Likes

We all know iter is faster than loop

This is not true as-is. Iterators are not magic. A given piece of code may be slightly faster when written using an iterator (when the iterator allows for elision of bounds checks, as someone already mentioned here), or slightly slower (when the compiler has trouble optimizing away all the generic code they come with). In general, looping and iterators have approximately the same performance characteristics on average.

We all know iter is faster than loop...

I don't think we all know that. I suspect that many new to Rust would doubt it at first glance and then proceed to do some little tests to see for themselves. As you have. As I have:

Those repos contain various solutions to the respective problem, most suggested by folks here. Some, using iterators, match or exceed the performance of traditional loops. Some fall down badly.

Sorry those repos are a bit of a mess and the results not presented very clearly. But you could have some fun running them for yourself and trying to spot which will perform better by reading the code before running them.

How does iter get free of boundary checks? Doesn't it just move boundary checks to next method ? As I see here, is_empty!(self) is the boundary check.
But as the comment just below next method implies could be implemented with slices, but this avoids bounds checks, means there should be no bounds checks in this implementation, I am confused.

// Inlining is_empty and len makes a huge performance difference
macro_rules! is_empty {
    // The way we encode the length of a ZST iterator, this works both for ZST
    // and non-ZST.
    ($self: ident) => {
        $self.ptr.as_ptr() as *const T == $self.end
    };
}

#[inline]
fn next(&mut self) ->  Option<$elem> {
// could be implemented with slices, but this avoids bounds checks

// SAFETY: `assume` calls are safe since a slice's start pointer
// must be non-null, and slices over non-ZSTs must also have a
// non-null end pointer. The call to `next_unchecked!` is safe
// since we check if the iterator is empty first.
    unsafe {
        assume(!self.ptr.as_ptr().is_null());
        if mem::size_of::<T>() != 0 {
            assume(!self.end.is_null());
        }
        if is_empty!(self) {
            None
        } else {
            Some(next_unchecked!(self))
        }
    }
}

Consider the for loop with indexes, rewritten a bit to make the operations that happen more clear:

let mut i = 0;
let len = vec.len();

while i < len {
    vec[i] = i;
    
    i += 1;
}

In this example, both i < len and vec[i] contain a runtime-check that i is less than len. In this simple example, the compiler can probably remove the vec[i] bounds check, but it may not be able to in more complicated examples. With iterators, the second bounds check does not even appear in the unoptimized code due to an unsafe block inside the iterator.

4 Likes

Thanks, so vec[i] = i; uses index and introduces double bounds check, and iter only checks once.

2 Likes

In some sense, it's not that iterators don't have bounds checks, it's just that they only have one.

2 Likes

...but why?

My take on it, without having an actual clue of the actual details is:

When the programmer writes traditional C style loops with for, while, array indexing, index incrementing, etc they can be pretty sure that array bounds are not exceeded. Typically without coding any array bound checks themselves. After all, they wrote the code, they know it is right. Right? (Well, at least they will know it is right after testing and/or years in production)

The question then is, can a compiler do static analysis on that code at compile time and convince itself no array bounds violations will happen and hence no run-time checks need be inserted. As far as I can make out this is still something of an area of research and the checkers we have so far are not very good.

On the other hand if the programmer does the same task using iterators, then the compiler is actually writing all the complex mechanics of the loops and indexing. The compiler is fully in control of the code it generates, it will not generate code that causes array bounds violations, so it knows when array bound checks are not needed. Magic!

How well this all works in practice though is another matter. My experiments show a lot of variability of performance with different styles of writing the same thing. It's kind of hard to predict from the source code which will be quicker. But this is also my experience when writing in C or C++.

As a long time C user looking at the new world of .iter.. .zip.. .enumerate.. .collect. and lambda this and that, I'm amazed it works as well as it does. After all, on the face of it, the source is introducing all kind of method and function calls that don't exist in C style loops. Seems all that "overhead" one is looking at gets optimized away very well. Really magic!

2 Likes

Actually, I was recently wondering what the cost of array bounds checks are anyway.

Our modern processors have deep pipelines, multiple instruction dispatch, speculative execution, branch predictors and whatever else.

It seems likely then that array bounds checks can be lost in the noise or had for free.

If your array access is to some random element the cost of fetching it from main memory can be hundreds of times more than that of an array bounds check. Which becomes hardly noticeable.

1 Like

In Rust, iterators are not really special w.r.t. the compiler "knowing about them" (except for the IntoIterator trait which is required for for…in syntax to work). So a slight correction to that statement is that it's not the compiler that knows how to write more efficient code, but the person who implemented the iterator.

For example, here's the implementation of Iterator for <[T]>::Iter. It uses a ton of unsafe, and in particular, next_unchecked!(), which is how it avoids the additional bounds check.

4 Likes

Thanks for the clarification. As you see it's not always clear to me what magic is down to the compiler and what magic is down to the library authors.

Perhaps I should spend some time coding without std/core.

Thanks for all your replies !

So, in conclusion, iter (sometimes) can be slightly faster because it avoids double bounds check, hardly any optimization other than that (except for some iterators that implement certain trait, as @quinedot mentioned above). Iter is not guaranteed to be faster (even maybe a little bit slower in some case) but definitely less code and better readability.

Getting curious I added some cargo bench tests to asingingbird's expample code like so:

#[cfg(test)]
mod tests {
    use test::{Bencher, black_box};
    use super::VEC_SIZE;
    use super::*;

    #[bench]
    fn bench_use_iter(b: &mut Bencher) {
        let mut vec = vec![0; VEC_SIZE];
        b.iter(|| {
            black_box(use_iter(&mut vec));
        });
    }

    #[bench]
    fn bench_use_for(b: &mut Bencher) {
        let mut vec = vec![0; VEC_SIZE];
        b.iter(|| {
            black_box(use_for(&mut vec));
        });
    }

    #[bench]
    fn bench_use_while(b: &mut Bencher) {
        let mut vec = vec![0; VEC_SIZE];
        b.iter(|| {
            black_box(use_while(&mut vec));
        });
    }
}

The results are not so conclusive as to whether for, while or iter is quicker:

$ cargo bench
   Compiling forwhileiter v0.1.0 (/home/zicog/forwhileiter)
    Finished bench [optimized] target(s) in 0.90s
     Running target/release/deps/forwhileiter-d01e0792b2ae0893

running 3 tests
test tests::bench_use_for   ... bench:   9,613,405 ns/iter (+/- 1,967,807)
test tests::bench_use_iter  ... bench:   9,143,740 ns/iter (+/- 985,055)
test tests::bench_use_while ... bench:   9,001,185 ns/iter (+/- 216,612)

test result: ok. 0 passed; 0 failed; 0 ignored; 3 measured; 0 filtered out

$ cargo bench
    Finished bench [optimized] target(s) in 0.01s
     Running target/release/deps/forwhileiter-d01e0792b2ae0893

running 3 tests
test tests::bench_use_for   ... bench:   8,879,995 ns/iter (+/- 155,249)
test tests::bench_use_iter  ... bench:   9,018,180 ns/iter (+/- 116,579)
test tests::bench_use_while ... bench:   8,931,535 ns/iter (+/- 177,088)

test result: ok. 0 passed; 0 failed; 0 ignored; 3 measured; 0 filtered out

$ cargo bench
    Finished bench [optimized] target(s) in 0.01s
     Running target/release/deps/forwhileiter-d01e0792b2ae0893

running 3 tests
test tests::bench_use_for   ... bench:   9,061,865 ns/iter (+/- 190,441)
test tests::bench_use_iter  ... bench:   9,069,900 ns/iter (+/- 167,675)
test tests::bench_use_while ... bench:   9,265,630 ns/iter (+/- 451,000)

test result: ok. 0 passed; 0 failed; 0 ignored; 3 measured; 0 filtered out

Things get interesting when one reduces the size of the Vec. Say only a million elements:

test tests::bench_use_for   ... bench:     669,440 ns/iter (+/- 71,928)
test tests::bench_use_iter  ... bench:     483,301 ns/iter (+/- 31,536)
test tests::bench_use_while ... bench:     674,707 ns/iter (+/- 21,256)

Or only a thousand elements:

test tests::bench_use_for   ... bench:         671 ns/iter (+/- 12)
test tests::bench_use_iter  ... bench:         164 ns/iter (+/- 3)
test tests::bench_use_while ... bench:         650 ns/iter (+/- 12)

The iterator is doing very well at these sizes.

I can't help thinking the effects of caches dominate at large sizes.

Benchmarking is hard.

It's not just the cost of the bound check, but also the lost optimizations. For example in asingingbird's original code, user_iter's loop is automatically vectorized by LLVM but use_for's and use_while's loops are not. Interestingly in use_while inling vec.len() in the while condition let LLVM remove the bound checks, however still doesn't auto vectorize the loop. Compiler Explorer

2 Likes

Does not look like less code to me:

The for loop solution has the least characters at 32"

for i in 0..vec.len(){vec[i]=i}

The iter solution has much more at 49:

vec.iter_mut().enumerate().for_each(|(i,v)|*v=i);

As an old C hand I would argue about the readability as well.

The iter solution introduces three method calls that the reader needs to be familiar with. It introduces that | ... | syntax. What's that about? Ah, it's yet another function call. None of this is at all obvious. It's even less obvious that most of that apparent calling gets optimized away,

Meanwhile anyone coming to Rust from a wide range of other languages, Java, C#, Javascript, going all the way back to ALGOL, Pascal, Ada,. PL/M, (that's most of the worlds programmers) could have a good guess at what that the for loop does with a high probability of being right.

All this functional programming style is new and exotic stuff that most have never seen. As I have said here before, I write my Rust as simply as possible, so that my non-Rust using colleagues stand some chance of understanding it even if they don't know enough Rust to work on it. I don't want to frighten them away from Rust :slight_smile:

That makes sense to me. I have seen use of iter allowing the compiler to vectorize code where otherwise it did not.

However in this case it's not clear if, or when, use of iter produces faster results. See benchmarks above.

Another interesting example is VecDeque. Its iterator implementation is faster (at least when cache misses are not the bottleneck) than a normal loop even if bound checks are removed due to how the next element is calculated. Even more interesting, consuming the iterator with for_each is faster than using for because it is written in a way that let LLVM auto vectorizing the loop while for doesn't due to the wrapping behaviour of VecDeque. To get the same result without for_each you'll need to split the loop in two by calling VecDeque::as(_mut)_slices

Curiosity continuing I tried the benchmark on the ARM processor on the Raspberry Pi 4, using 64 bit mode. For ten million elements iter does not help at all:

test tests::bench_use_for   ... bench:  24,295,083 ns/iter (+/- 66,382)
test tests::bench_use_iter  ... bench:  24,277,390 ns/iter (+/- 72,666)
test tests::bench_use_while ... bench:  24,298,114 ns/iter (+/- 83,775)

Again things look much better for iter when one gets down to only a thousand elements:

test tests::bench_use_for   ... bench:       1,407 ns/iter (+/- 2)
test tests::bench_use_iter  ... bench:         477 ns/iter (+/- 0)
test tests::bench_use_while ... bench:       1,406 ns/iter (+/- 1)

Somewhere around five hundred thousand elements iter begins to lose it's magic:

test tests::bench_use_for   ... bench:   1,018,592 ns/iter (+/- 182,787)
test tests::bench_use_iter  ... bench:     989,294 ns/iter (+/- 94,995)
test tests::bench_use_while ... bench:     985,342 ns/iter (+/- 31,504)