Why `Iter` is slower than `loop` in this situation?

I'm trying to implement a custom iterator in a struct as following:

pub struct MovingWindow<'a, T, N> {
    slice: &'a [T],
    n: N,
    count: usize,
    init_size: usize

impl<'a, 'b, T> Iterator for MovingWindow<'a, T, &'b Vec<usize>> {
    type Item = &'a [T];

    fn next(&mut self) -> Option<Self::Item> {
        if self.count <= self.init_size {
            let i = self.count;
            let win = self.n[i - 1];
            let start_i = if i < win { 0usize } else { i - win };
            // println!("i: {i}, win: {win}, start_i: {start_i}");
            let ret = Some(&self.slice[start_i..i]);
            self.count += 1;
        } else {

pub trait Rolling {
    type T;
    fn rolling<N>(&self, n: N) -> MovingWindow<Self::T, N>;

impl<N> Rolling for [N] {
    type T = N;
    fn rolling<K>(&self, n: K) -> MovingWindow<Self::T, K>
        MovingWindow { slice: self, n, count: 1, init_size: self.len()}

The iterator's Item is a Slice, the len of which depends on the MovingWindow's field n(a Vec<usize>).
And I have the following trait on [T], which would do the same thing as above but with using loop:

pub trait RollWindow: AsRef<[f32]> {
    fn roll_window_sum(&self, window: &[usize]) -> Vec<f32> {
        let data = self.as_ref();
        let mut res = vec![f32::NAN; data.len()];
        for (i, win) in window.iter().enumerate() {
            let start_i = if i + 1 < *win { 0 } else { i + 1 - win };
            let data_part = &data[start_i..i + 1];
            res[i] = data_part.iter().sum::<f32>()
impl RollWindow for [f32] {}
impl RollWindow for Vec<f32> {}

I tested the performace of both methods as following:

let data1 = vec![1f32; 100_000_000];
let data2 = vec![10usize; 100_000_000];

t!(data1.rolling(&data2).map(|x| x.iter().sum::<f32>()).collect::<Vec<f32>>());//test the iter method

t!(data1.roll_window_sum(&data2));//test the loop method

This is the playground

I use both Iter(the standard iterator, not implementing by myself) and loop a lot, and I didn't find any significant performance difference between them. What am I missing about the implementing above that make the iter slow?

Do you use the release mode?

1 Like

Yes, I test it with release mode.

I would be suspicious of performing a benchmark this way. You may want to look into cargo-bench or some other benchmarking framework. These will run the test many, many times to try to average out the ambient system noise, and avoid the effect of "cold start" where the CPU cache may not be filled.

Not sure whether you'll see a large difference, but a quick first step could be to reverse the order of the tests, and run it a few times.


This could be down to indexing into the &Vec<usize> based on indices being slower than going through it with a proper iterator. You could try saving a std::iter::Enumerate<std::slice::Iter<'_, usize> in your iterator too, instead of the &Vec<usize>, to see if that approach removes the difference in performance.


Could it have to do with the collect? In the second function you preallocate the output.

Have you compared the assembly from both implementations?

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.