Hello,

As many know I am working with large matrices, like for example 16384 x 16384.

When I run my program with a matrix, the execution takes x seconds. I run it again, and it takes x + 20 seconds. I run it again and it takes x - 5 seconds, and so on.

Using perf, I realised that the number of cycles varies greatly between runs (the fewer the cycles, the less time my programme takes). Obviously, being long matrices, I parallel the algorithm with Rayon.

Do you know why this difference? I'm running the same matrix, without modifying anything, without compiling again, just running (always with release and maximum optimization) and it gives me totally different times.

My matrix has the following form:

```
const MAX: usize = 16384;
const BLOCK_SIZE: usize = 128;
const BLOCK_ELEMS: usize = MAX / BLOCK_SIZE;
type Array<T> = [T; BLOCK_SIZE];
type Matrix<T> = [Array<T>; BLOCK_SIZE];
type Matrix2<T> = [Matrix<T>; BLOCK_ELEMS];
type BlockMatrix<T> = [Matrix2<T>; BLOCK_ELEMS];
```

And I created like the following:

```
let matrix = unsafe {
let layout = std::alloc::Layout::new::<BlockMatrix<f32>>();
let ptr = std::alloc::alloc_zeroed(layout) as *mut BlockMatrix<f32>;
Box::from_raw(ptr)
};
```

Thank you !