How to calculate rolling(x).mean() as fast as possible?

hey, guys.
in python we can use

import pandas as pd
a = pd.Series([10,15,12,13,1,5,4,1,5,6,4,1,54,5,4,1,5,4,5,6,4,4])
mean = a.rolling(3).mean()
print(mean)

but in rust we may implement it like

let mut array = vec![]
loop {
    if let Some(i) = mm.recv() {

     // as fast as possible 
      array.push(*i as f64)
      if len(array) > 3{
       array.remove(0)
     }
     let mean = array.sum::<f64)() / 3.0;
     dbg!(mean);

   }
}

it's easily to know that the array is a heap
and the allocation will happen when i remove element and push element .

notice:
the element was pushed one by one

so anyone know how to calculate it as fast as possible?

You may want to look at <[T]>::windows().

just save space

If it is too large, the speed may be slowed down.

and look at here , data is not preloaded.

I don't understand what your problem is. Please rephrase it more clearly. Here are some facts:

  • the performance of windows() doesn't depend on the window size or the length of the original slice; it uses pointer arithmetic internally, so creating it and advancing it is essentially free.
  • windows() doesn't consume dynamically allocated memory. It's O(1) space.
  • If your data is of length N, there is no way you can get faster than O(n) in calculating windowed averages. Pandas' implementation isn't faster than that, either. Pandas' Series data type can't store N elements in less than O(N) space, either.

Here's a complete demonstration that's as fast as it gets using the standard library. For small window sizes, it's likely to be auto-vectorized, too.

3 Likes

.windows() is enough to reproduce your python example. Maybe it's oversimplified one. Can you give us another example which is more similar with your real code?

sorry, i have changed the question. and maybe you can understand it .

i use an array to calculate the mean(). I wonder if it can be made faster.

The python example is just to show rolling(x).mean().

A couple of comments:

  1. First of all, you are using a Vector, not an array (the distinction is important, since you are concerned about heap allocation).
  2. Second, with your current implementation, the pushing and removal of the elements will never cause the vector to expand or reallocate beyond the 4th iteration: it will have 4 elements when pushing, then it will have its 0-th element removed, so it will now have 3 elements and a capacity of (at least) 4. This means that upon subsequent iterations, it won't ever need to reallocate anymore.
  3. Next, modern computers make extensive use of caching, prefetching, and speculative execution. This means that for a reasonable small number of elements (say, 16 f32s or 8 f64s that fit in a typical cache line of 64 bytes), linearly scanning a flat array and moving its elements one at a time is very efficient. It in fact likely has a better constant factor than whatever clever algorithm we might be able to come up with.
  4. It's still possible to avoid the dynamic allocation altogether and just use an actual array (as opposed to a Vec) of 3 elements. Better yet, use ArrayVec.
  5. You can also use VecDeque so that pushing the new element and removing the oldest one is actually technically O(1).
1 Like

Your code in python is not equivalent to that in Rust, which is really confusing and annoying.

I guess each element is gained at a time, like from an iterator or steam. If you know the denominator at compile time, you can use array with const generics:

struct Mean<const N: usize> {
    arr: [f64; N],
    pos: usize,
}

impl<const N: usize> Mean<N> {
    fn new() -> Self {
        Mean {
            arr: [0.0; N],
            pos: 0,
        }
    }

    fn push(&mut self, value: f64) {
        self.arr[self.pos] = value;
        self.pos = (self.pos + 1) % N;
    }
    fn calc(&self) -> f64 {
        self.arr.iter().sum::<f64>() / N as f64
    }
}

Thank you very much for your pointers

[image]

To do this as fast as possible:

  1. Don't calculate the sum anew each time. Keep the sum of the last N entries around; when you get a new entry, subtract off the oldest entry and add in the new one.

  2. Use an integer type (if possible) rather than f64 for the values. Your example is all integers.

  3. Make N a generic parameter, and then where possible use a power of 2 for N. You'll be doing i = (i+1) %N to calculate the index into your storage array, and if you do it right with a power of 2 the % will become an AND, which is a single-cycle op. For example, if N = 4 then x % N is just x & 3.

  4. If you are using an integer type and just want the average as an integer, use (sum + N/2) / N to calculate it. If N is a power of 2, the divide will just be a right shift. For example, if N = 4, then it becomes (sum + 2) >> 2.

  5. If you are using an integer type and want a little more accuracy than an integer mean, then don't divide by anything and interpret the sum as having an implicit fractional part. For example, 7 + 4 + 2 + 6 = 19, so the mean of those 4 numbers is 19 4ths. It's hard to beat not doing anything when it comes to performance!

  6. Use an array inside your struct (rather than a Vec or VecDeque) to hold the values so that it's more likely to be in the cache.

  7. If possible, preload the sum and the array so that you can avoid having to have an if-statement check to see whether you have collected N entries.

1 Like

Subtracting the oldest and adding the newest is indeed the fastest way, but it also accumulates floating point errors as the window moves, making the result less precise, especially if the data is large.