Hey everyone - I was playing around with Rust and decided to do a small comparison of naively multiplying two matrices together using the standard O(n^3) algorithm. The results surprised me and am wondering if anyone has any insight into the following:

- The naive algorithm in Rust was nearly 2.5x slower than C++ (and yes, I am using a release build). The matrices were 1000x1000 and I'm on an old Thinkpad, so the times were on the order of seconds (e.g., 6.89s for Rust, 2.68 for C++).
- If instead of updating the matrix like
`m[i][j] += a[i][k] * b[k][j]`

, I use a temporary variable to hold the sum and then do`m[i][j] = s`

Rust is only 1.5x slower than C++. Making this change in the C++ code does not materially impact the runtime. So this is quite curious for Rust. - If I take the transpose of the second matrix to allow me to use iterators then Rust's performance is on par with C++ (using the same transpose algorithm in C++ as well).

What is it about iterating over the matrix in rust that is causing this performance discrepancy? It is clear Rust *can* have the same performance as C++ (and C by the way, the times for C and C++ were nearly the same), as shown by the transpose algorithm. I don't really care about multiplying matrices as I can use Eigen in C++ or n-algebra, etc. in Rust. What I am more interested in is what is it about the naive multiplication algorithm in Rust that is causing the performance difference. Is there a clean way to take a column slice that I don't see? Why does the temporary help so much?

Naive algorithm with temporary update

```
type MatrixF64 = Vec<Vec<f64>>;
fn mul(x: &MatrixF64, y: &MatrixF64) -> MatrixF64 {
let mut result = vec![vec![0.0; x.len()]; y[0].len()];
for i in 0..x.len() {
for j in 0..y.len() {
let mut s = 0.0;
for k in 0..x.len() {
s += x[i][k] * y[k][j]; // faster than updating result[i][j] directly
}
result[i][j] = s;
}
}
result
}
```

Transpose algorithm - can use iterators and has better cache locality

```
fn mul_alt(x: &MatrixF64, y: &MatrixF64) -> MatrixF64 {
let mut result = vec![vec![0.0; x.len()]; y[0].len()];
let mut y_t = vec![vec![0.0; x.len()]; y[0].len()];
for i in 0..y_t.len() {
for j in 0..y_t[0].len() {
y_t[i][j] = y[j][i];
}
}
for i in 0..x.len() {
for j in 0..y_t.len() {
result[i][j] = x[i].iter()
.zip(y_t[j].iter())
.map(|(&a, &b)| a*b)
.sum::<f64>();
}
}
return result;
}
```