I have the following snippet of code that produces the rolling mean of a vector. This performs significantly worse (on my local macos laptop, I'm seeing 100x worse performance) than some equivalent pandas code. I'm using cargo run --release to build/run my rust code.

import pandas as pd
df = pd.DataFrame({"pos": np.ones(1_000_000)})
df.pos.rolling(288*7).mean()

use std::time::Instant;
fn mean_slice(window: &[f64]) -> f64
{
let sum: f64 = window.iter().sum();
sum / (window.len() as f64)
}
pub fn main() {
let niter = 2;
let window_size = 288*7;
let pos: Vec<f64> = (0..1_000_000).map(|_| 1.0).collect();
let d0 = Instant::now();
for _ in 0..niter {
let d1 = Instant::now();
let vec: Vec<f64> = pos.windows(window_size).map(|window| mean_slice(window)).collect();
let delta_inner = Instant::now() - d1;
println!("delta_inner={:?}", delta_inner);
println!("vec.sum()={}", vec.iter().sum::<f64>());
}
let delta = Instant::now() - d0;
println!("delta={:?}, delta per loop={:?}", delta, delta / (niter as u32));
}

It may help to add -Ctarget-cpu=native to rustflags.

Also note that println! is exceptionally slow in Rust, and probably slower than all the calculations that you're doing. So far every thread about Rust being slower than Python boiled down to println! taking 99% of the time.

I think I figured out what's going on here. df.rolling(n).mean() is exceptionally fast because it is not using the same algorithm I demonstrate in my Rust snippet. The windowed mean algorithm I show is O(n*m) (where m is the size of the window) whereas the pandas implementation manages to do it closer to O(2*n).

I implemented a rolling mean similar to the one used in pandas. Out of the box, this is already 4 times fast than the pandas implementation. In principle I would turn this into an Iterator, but for now we're just returning a Vec (I'm definitely open to input on how to improve this!):

use std::iter::Sum;
fn mean_slice<'a, T>(window: &'a [T]) -> Option<T>
where T: num::Float + Sum<&'a T>
{
let sum: T = window.iter().sum();
num::cast::<usize, T>(window.len()).map(|len| sum / len)
}
fn mean_rolling<'a, T>(arr: &'a [T], window_size: usize) -> Vec<T>
where T: num::Float + Sum<&'a T>
{
let window_size_T: T = num::cast::<usize, T>(window_size).unwrap();
let mut mean_prev_opt: Option<(T, T)> = None;
let mut res: Vec<T> = Vec::with_capacity(arr.len() - window_size + 1);
for window
in arr.windows(window_size).into_iter() {
let mean = match mean_prev_opt {
None => {
mean_slice(window).unwrap()
},
Some((prev_mean, prev)) => {
let next = window.last().unwrap();
let mean = prev_mean + (*next - prev) / window_size_T;
mean
}
};
let prev = window.first().unwrap();
res.push(mean);
mean_prev_opt = Some((mean, *prev))
}
res
}
fn main() {
let z_score_window_size = 288*7;
let pos: Vec<f64> = (0..1_000_000).map(|_| 1.0).collect();
let res: Vec<f64> = mean_rolling(pos.as_slice(), z_score_window_size);
}

You don't need the into_iter() in the for loop, and you should remove the Option around the previous value by special-casing the first iteration instead of checking if it's the first iteration upon every iteration. That will likely increase optimizability a bit.

If you don't want a full iterator (as it's more complex to write), you can spare allocations by writing to a mutable slice instead of pushing to a vector. This should be easy enough to implement.

Note that the faster algorithm may drift away from the true value over time as floating-point arithmetic isn't commutative: a + x - a doesn't necessarily equal x.

It's always cool to hear how someone read through the source code for something, realised their initial assumptions about it were wrong, and then could understand the algorithm so well they were able to reimplement it themselves.