Very different number of Cycles using Perf

Hello,

As many know I am working with large matrices, like for example 16384 x 16384.

When I run my program with a matrix, the execution takes x seconds. I run it again, and it takes x + 20 seconds. I run it again and it takes x - 5 seconds, and so on.

Using perf, I realised that the number of cycles varies greatly between runs (the fewer the cycles, the less time my programme takes). Obviously, being long matrices, I parallel the algorithm with Rayon.

Do you know why this difference? I'm running the same matrix, without modifying anything, without compiling again, just running (always with release and maximum optimization) and it gives me totally different times.

My matrix has the following form:

const MAX: usize = 16384;
const BLOCK_SIZE: usize = 128;
const BLOCK_ELEMS: usize = MAX / BLOCK_SIZE;

type Array<T> = [T; BLOCK_SIZE];
type Matrix<T> = [Array<T>; BLOCK_SIZE];
type Matrix2<T> = [Matrix<T>; BLOCK_ELEMS];
type BlockMatrix<T> = [Matrix2<T>; BLOCK_ELEMS];

And I created like the following:

let matrix = unsafe {
    let layout = std::alloc::Layout::new::<BlockMatrix<f32>>();
    let ptr = std::alloc::alloc_zeroed(layout) as *mut BlockMatrix<f32>;
    Box::from_raw(ptr)
};

Thank you !

Rayon's work-stealing is not deterministic -- there's explicit randomness in choosing which thread to try stealing from, and implicit runtime randomness in multithreading if threads do steal from the same queue.

How long is your total run? A few seconds of variance seems surprising, but hopefully that's small compared to the full time.

Well, some times the execution takes 20 seconds, other 16 seconds, other 35 seconds, other 45 seconds, and so on.

I understand what you're saying, but I don't have so much dependence on data.

Using Rc and Weak could help ?