Rust rolling average significantly slower than pandas pd.DataFrame.rolling

I have the following snippet of code that produces the rolling mean of a vector. This performs significantly worse (on my local macos laptop, I'm seeing 100x worse performance) than some equivalent pandas code. I'm using cargo run --release to build/run my rust code.

import pandas as pd 

df = pd.DataFrame({"pos": np.ones(1_000_000)})

df.pos.rolling(288*7).mean()
use std::time::Instant;


fn mean_slice(window: &[f64]) -> f64
{
    let sum: f64 = window.iter().sum();
    sum / (window.len() as f64)
}


pub fn main() {
    let niter = 2;
    let window_size = 288*7;
    let pos: Vec<f64> = (0..1_000_000).map(|_| 1.0).collect();
    let d0 = Instant::now();
    for _ in 0..niter {
        let d1 = Instant::now();
        let vec: Vec<f64> = pos.windows(window_size).map(|window| mean_slice(window)).collect();
        let delta_inner = Instant::now() - d1;
        println!("delta_inner={:?}", delta_inner);
        println!("vec.sum()={}", vec.iter().sum::<f64>());
    }
    let delta = Instant::now() - d0;
    println!("delta={:?}, delta per loop={:?}", delta, delta / (niter as u32));
}

(Playground)

It may help to add -Ctarget-cpu=native to rustflags.

Also note that println! is exceptionally slow in Rust, and probably slower than all the calculations that you're doing. So far every thread about Rust being slower than Python boiled down to println! taking 99% of the time.

4 Likes

I think I figured out what's going on here. df.rolling(n).mean() is exceptionally fast because it is not using the same algorithm I demonstrate in my Rust snippet. The windowed mean algorithm I show is O(n*m) (where m is the size of the window) whereas the pandas implementation manages to do it closer to O(2*n).

To get a sense of what pandas is doing, see here

2 Likes

I implemented a rolling mean similar to the one used in pandas. Out of the box, this is already 4 times fast than the pandas implementation. In principle I would turn this into an Iterator, but for now we're just returning a Vec (I'm definitely open to input on how to improve this!):

use std::iter::Sum; 

fn mean_slice<'a, T>(window: &'a [T]) -> Option<T>
    where T: num::Float + Sum<&'a T>
{
    let sum: T = window.iter().sum();
    num::cast::<usize, T>(window.len()).map(|len| sum / len)
}

fn mean_rolling<'a, T>(arr: &'a [T], window_size: usize) -> Vec<T>
    where T: num::Float + Sum<&'a T>
{

    let window_size_T: T = num::cast::<usize, T>(window_size).unwrap();

    let mut mean_prev_opt: Option<(T, T)> = None;
    let mut res: Vec<T> = Vec::with_capacity(arr.len() - window_size + 1);

    for window
    in arr.windows(window_size).into_iter() {
        let mean = match mean_prev_opt {
            None => {
                mean_slice(window).unwrap()
            },
            Some((prev_mean, prev)) => {
                let next = window.last().unwrap();
                let mean = prev_mean + (*next - prev) / window_size_T;
                mean
            }
        };
        let prev = window.first().unwrap();
        res.push(mean);
        mean_prev_opt = Some((mean, *prev))
    }
    res
}

fn main() {
    let z_score_window_size = 288*7;
    let pos: Vec<f64> = (0..1_000_000).map(|_| 1.0).collect();
    let res: Vec<f64> = mean_rolling(pos.as_slice(), z_score_window_size);
}

You don't need the into_iter() in the for loop, and you should remove the Option around the previous value by special-casing the first iteration instead of checking if it's the first iteration upon every iteration. That will likely increase optimizability a bit.

If you don't want a full iterator (as it's more complex to write), you can spare allocations by writing to a mutable slice instead of pushing to a vector. This should be easy enough to implement.

2 Likes

Here's what I came up with (f64 only). It uses a const generic for the window size, which lets you avoid heap allocations at the cost of flexibility.

pub struct RollingAvg<const N: usize, I: ?Sized = dyn Iterator<Item = f64>> {
    sum: f64,
    next_idx: usize,
    window: [f64; N],
    source: I,
}

impl<const N: usize, I> Iterator for RollingAvg<N, I>
where
    I: ?Sized + Iterator<Item = f64>,
{
    type Item = f64;
    fn next(&mut self) -> Option<f64> {
        let next = self.source.next()?;
        self.sum += next;
        self.sum -= std::mem::replace(&mut self.window[self.next_idx], next);
        self.next_idx += 1;
        if self.next_idx == N {
            self.next_idx = 0;
        }
        Some(self.sum / (N as f64))
    }
}

impl<const N: usize, I> RollingAvg<N, I>
where
    I: Iterator<Item = f64>,
{
    pub fn new(mut source: I) -> Option<Self> {
        let mut window = [0.0; N];
        let mut sum = 0.0;
        for i in 0..N {
            window[i] = source.next()?;
            sum += window[i];
        }
        Some(RollingAvg {
            window,
            sum,
            source,
            next_idx: 0,
        })
    }
}

fn main() {
    let res: Vec<_> = RollingAvg::<{ 288 * 7 }, _>::new((0..1_000_000).map(|_| 1.0))
        .unwrap()
        .collect();
    dbg!(&res[0..100]);
}

Note that the faster algorithm may drift away from the true value over time as floating-point arithmetic isn't commutative: a + x - a doesn't necessarily equal x.

Nice investigation!

It's always cool to hear how someone read through the source code for something, realised their initial assumptions about it were wrong, and then could understand the algorithm so well they were able to reimplement it themselves.