Why iterating over the iterator is 3x slower

While benchmarking creating a vector from an iterator, I noticed the following performance difference

use criterion::{criterion_group, criterion_main, Criterion};

fn vec1<T, I: Iterator<Item = T>>(iterator: I) -> Vec<T> {
    iterator.collect()
}

// I know this should be unsafe; it should be `TrustedLen`, but for this example it is enough
fn vec2<T, I: Iterator<Item = T>>(iterator: I) -> Vec<T> {
    let (_, upper) = iterator.size_hint();
    let upper = upper.expect("trusted_len_iter requires an upper limit");
    let len = upper;

    let mut buffer = Vec::with_capacity(len);
    let mut dst = buffer.as_mut_ptr();
    unsafe {
        for item in iterator {
            std::ptr::write(dst, item);
            dst = dst.add(1);
        }
    }
    unsafe { buffer.set_len(len) };
    buffer
}

fn add_benchmark(c: &mut Criterion) {
    let values = 0..1026;

    c.bench_function("vec", |b| {
        b.iter(|| assert_eq!(vec1(values.clone().into_iter())[1025], 1025))
    });

    c.bench_function("vec_2", |b| {
        b.iter(|| assert_eq!(vec2(values.clone().into_iter())[1025], 1025))
    });
}

criterion_group!(benches, add_benchmark);
criterion_main!(benches);
vec                     time:   [117.76 ns 117.91 ns 118.06 ns]
vec_2                   time:   [347.36 ns 348.16 ns 349.10 ns] 

I am a bit puzzled by this behavior.

1 Like

Sorry, my mistake of misreading the code - I don't see a vec iterator there. libstd has some specialization-based optimizations, and should in this case - collecting a fresh Vec into_iter - just take back the Vec allocation and return a vector without any allocation or copying. It then manages to do the .collect() without very much work, and that's probably the explanation.

These are implementation details, and can change in the future of course.

5 Likes

For what it's worth vec2 is twice as fast as vec1 on Apple Silicon:

vec                     time:   [199.49 ns 199.57 ns 199.66 ns]
vec_2                   time:   [98.734 ns 98.785 ns 98.843 ns]

As for specialization - I don't think there is a specialization for collecting a range iterator into a Vec so I don't see a reason why vec2 should be slower on x86-64.

1 Like

vec_2 is also faster on my x86_64 laptop:

vec                     time:   [184.18 ns 185.58 ns 187.05 ns]                
vec_2                   time:   [116.24 ns 117.37 ns 118.69 ns]                  
1 Like

I am sorry, this was indeed a problem on my end; I also observe the same difference; I probably ran the bench on a different version of the code by mistake.

I am still puzzled, though: shouldn't this code result in the same performance? shouldn't .collect specialize via TrustedLen of the iterator?

I further simplified the bench to not have to clone on it, and inlined vec1, which AFAI can see should result in the same behavior?

use criterion::{criterion_group, criterion_main, Criterion};

#[inline]
fn vec1<T, I: Iterator<Item = T>>(iterator: I) -> Vec<T> {
    iterator.collect()
}

// I know this should be unsafe; it should be `TrustedLen`, but for this example it is enough
fn vec2<T, I: Iterator<Item = T>>(iterator: I) -> Vec<T> {
    let (_, upper) = iterator.size_hint();
    let upper = upper.expect("trusted_len_iter requires an upper limit");
    let len = upper;

    let mut buffer = Vec::with_capacity(len);
    let mut dst = buffer.as_mut_ptr();
    unsafe {
        for item in iterator {
            std::ptr::write(dst, item);
            dst = dst.add(1);
        }
    }
    unsafe { buffer.set_len(len) };
    buffer
}

fn add_benchmark(c: &mut Criterion) {
    let values = 0..1026;
    let values = values.collect::<Vec<_>>();

    c.bench_function("vec", |b| {
        b.iter(|| assert_eq!(vec1(values.iter().map(|x| x * 2))[1025], 2 * 1025))
    });

    c.bench_function("vec_2", |b| {
        b.iter(|| assert_eq!(vec2(values.iter().map(|x| x * 2))[1025], 2 * 1025))
    });
}

criterion_group!(benches, add_benchmark);
criterion_main!(benches);

.collect() on a TrustedLen iterator uses this specialized extend impl: https://github.com/rust-lang/rust/blob/1.53.0/library/alloc/src/vec/spec_extend.rs#L22-L56

The only real difference between this code and yours is that it uses a SetLenOnDrop guard so it won't leak items if the iterator panics. It looks like this might add a bit of overhead to the loop.

2 Likes

AFAICS SetLenOnDrop is used to avoid having to update the length in memory at every iteration. Interestingly enough, doing so does not seem to slow down anything:

fn vec3<T, I: Iterator<Item = T>>(iterator: I) -> Vec<T> {
    let (_, upper) = iterator.size_hint();
    let upper = upper.expect("trusted_len_iter requires an upper limit");
    let len = upper;

    let mut buffer = Vec::with_capacity(len);
    let mut dst = buffer.as_mut_ptr();
    unsafe {
        for item in iterator {
            std::ptr::write(dst, item);
            dst = dst.offset(1);
            buffer.set_len(buffer.len() + 1);
        }
    }
    buffer
}

vec3() is just as fast for me and still twice as fast as the stdlib implementation.

Edit: Not sure what issue SetLenOnDrop worked around exactly.

SetLenOnDrop worked around an aliasing issue - the compiler must see that the element store and the length store don't alias, otherwise that's a problem for the code generation of the loop.

This problem has been rather long lived, depends on the context of the surrounding code, and needs some investigation to see if it's finally gone. We've seen it recently in similarly shaped code in ndarray. I was hoping mutable noalias would take care of it, finally (but don't know).

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.