Looking for help understanding Rust's performance vs C++

Hey everyone - I was playing around with Rust and decided to do a small comparison of naively multiplying two matrices together using the standard O(n^3) algorithm. The results surprised me and am wondering if anyone has any insight into the following:

  1. The naive algorithm in Rust was nearly 2.5x slower than C++ (and yes, I am using a release build). The matrices were 1000x1000 and I'm on an old Thinkpad, so the times were on the order of seconds (e.g., 6.89s for Rust, 2.68 for C++).
  2. If instead of updating the matrix like m[i][j] += a[i][k] * b[k][j], I use a temporary variable to hold the sum and then do m[i][j] = s Rust is only 1.5x slower than C++. Making this change in the C++ code does not materially impact the runtime. So this is quite curious for Rust.
  3. If I take the transpose of the second matrix to allow me to use iterators then Rust's performance is on par with C++ (using the same transpose algorithm in C++ as well).

What is it about iterating over the matrix in rust that is causing this performance discrepancy? It is clear Rust can have the same performance as C++ (and C by the way, the times for C and C++ were nearly the same), as shown by the transpose algorithm. I don't really care about multiplying matrices as I can use Eigen in C++ or n-algebra, etc. in Rust. What I am more interested in is what is it about the naive multiplication algorithm in Rust that is causing the performance difference. Is there a clean way to take a column slice that I don't see? Why does the temporary help so much?

Naive algorithm with temporary update

type MatrixF64 = Vec<Vec<f64>>;

fn mul(x: &MatrixF64, y: &MatrixF64) -> MatrixF64 {
    let mut result = vec![vec![0.0; x.len()]; y[0].len()];
    for i in 0..x.len() {
        for j in 0..y.len() {
            let mut s = 0.0;
            for k in 0..x.len() {
                s += x[i][k] * y[k][j]; // faster than updating result[i][j] directly
            }
            result[i][j] = s;
        }
    }
    result
}

Transpose algorithm - can use iterators and has better cache locality

fn mul_alt(x: &MatrixF64, y: &MatrixF64) -> MatrixF64 {
    let mut result = vec![vec![0.0; x.len()]; y[0].len()];
    let mut y_t = vec![vec![0.0; x.len()]; y[0].len()];
    for i in 0..y_t.len() {
        for j in 0..y_t[0].len() {
            y_t[i][j] = y[j][i];
        }
    }
    for i in 0..x.len() {
        for j in 0..y_t.len() {
            result[i][j] = x[i].iter()
                .zip(y_t[j].iter())
                .map(|(&a, &b)| a*b)
                .sum::<f64>();
        }
    }
    return result;
}
4 Likes

My first guess is bound check. In rust accessing vec via [] always perform bound check, to panic on out of bound case (this prevent reading invalid/unowned memory, about what C++ doesn't care at all). There is also a possibility, that your C++ compiler handles some optimization better, but I don't think so (because iterator version is equal in speed - and iterator doesn't need any boundary checks). In general direct addressing should be avoided - its neither optimal, nor idiomatic.

To check if I guessed right, just change indexing to get_uncheked(idx) calls which doesn't perform boundary checks (and therefore are unsafe).

6 Likes

Switching to get_unchecked does indeed even the performance of Rust with C++. It is somewhat troubling, what is someone supposed to do if they can't avoid direct addressing? I have always had the impression that one should avoid using unsafe but this is a case where you really can't avoid it (unless you want to allocate a new array).

Again, I know there are the nalgebra and ndarray crates for this type of stuff, but it seems like a hard sell to use Rust in a numerical setting if you have to write code along the lines of:

for i in 0..x.len() {
  for j in 0..y.len() {
    for k in 0.x.len() {
      unsafe {
        result[i][j] += x.get_unchecked(i).get_unchecked(k) * y.get_unchecked(k).get_unchecked(j);
      }
    }
  }
}

I also have no problem with using iterators - I think they are more elegant than indexed loops - but it seems there are many settings where using an iterator either isn't possible or clunky (like this one). In this case, the bounds checking introduces quite a bit of overhead.

1 Like

You can "hoist" bounds checks out of the loop by using slices, as long as LLVM can follow that the slice len and index variable are related:

let len = vec.len();
let slice = &vec[0..len];
for i in 0..len {
    slice[i] // no bounds check
}

And unsafe {get_unchecked()} is fine. You can (with care) build a safe matrix multiplication function on top of it, so the risk doesn't spread.

14 Likes

The problem / question at hand is:

what should the function do on ill-formed inputs?

  • (a pair of Vec<Vec<f64>> can have any shape)

There are two answers to this questions:

  • Undefined Behavior,

  • defined (but garbage, like a panic) behavior

Rust forces the programmer and the code to be very explicit about this choice: if a function cannot handle all inputs, then it must be marked unsafe, which allows following C/C++ dangerous style. On the other hand, a defined behavior no matter how ill-formed the "matrices" may be requires runtime checks on the indexing, which comes at a performance cost.

The best solution for this (zero-cost safety) is having a whole type be only inhabited by valid matrices, hence having a type-level invariant. This is what ndarray, nalgebra and such do.

8 Likes

Note that this verbosity issue can be resolved by building an unsafe_indexing! macro that 1/wraps a code block in unsafe and 2/replaces every occurrence of something[i] with something.get_unchecked(i). For all I know, there might even already be a crate for that.

On the other hand, Rust makes unchecked indexing unsafe and noisy for a reason. Every call to get_unchecked() introduces a possible memory safety bug, which Rust is supposed to prevent.

Which is why the safe solution proposed by @kornel is usually favored, unless one is really operating inside of a hot loop where even the "outer" bound check already has unacceptable overhead.

Let's cross fingers that someday, rustc and LLVM will learn to do this bound check optimization automatically, instead of having us programmers write weird slicing incantations in order to benefit from it.

1 Like

Very interesting approach - I will need to play around with this some. It seems this would be simple for accessing the rows of the matrix, but not as easy for accessing the columns of a matrix. Then again, I can use something like x.iter().enumerate() to get the rows of matrix x and the corresponding index.

Or even simpler: iterate with decreasing indices: for i in (0 .. x.len()).rev()

1 Like

Also, note that if optimal performance is desired Vec<Vec<f64>> is almost always a bad choice. That's because, as @Yandros pointed out, this type is "too general" and allows each matrix column to have a different length, which in turn forces a safe program to perform O(N) sanity checks before processing the matrix. Furthermore even if you force each vector to have the correct length through careful encapsulation, the inner vectors will cause more memory allocations and reduce cache locality.

To avoid these problems, popular matrix libraries implement NxM matrices as a Vec<f64> of length N*M, and then translate the 2D indexing internally as mat[i][j] -> mat.data[j*M + i].

In some cases, performance can be increased futher by adding some padding at the end of each matrix column, so that e.g. it is a multiple of the CPU's SIMD vector length.

(NOTE: The above assumes a column-major, Fortran-like matrix layout. If using a row-major, C-like matrix layout, every statement about rows and columns must be transposed...)

10 Likes

Agreed completely with the comments about Vec<Vec<f64>>. I guess this is really more an issue of indexing than data representation and the take-away is simply don't do it if you can avoid it. It seems @HadrienG 's suggestion is the most ergonomic, but not particularly ideal.

1 Like

Is it really the issue? Because of for your particular example - matrix multiplication,
the cache aware code gives ~ 1.7x speedup (according to Ulrich Drepper's famous article) and usage of CPU vector extension gives ~ 10x according to my tests.
So is any point to not use specific matrix crate X that may implement all such kind of optimizations?

2 Likes

I fully agree that @HadrienG has the best solution in this thread so far, but if I've learned one thing about performance-tuning it is that there is a practically-infinite slippery slope of improvements.
We're not even using SIMD yet!
I know I'm not good enough to make those improvements myself, but with Cargo, I don't need to :smile:

As @Dushistov mentioned: you're aware of ndarray and nalgebra as 'even better' solutions. Why not use those?

Rust's dependency management superpowers shouldn't be underestimated as a power multiplier in the numerical ecosystem.
While the likes of Fortran, C/C++ (and even Python) might have better libraries available (for now :relaxed: ), none of those languages make integrating those libraries as pain-free as Rust does, which allows us to catch up at a frankly insane rate. I mean, Rust 1.0 is ~4 years old, and we're already going toe-to-toe with languages that have a multi-decade head-start!

Edit to add: If you're serious about numerical perf, you'll also want to keep an eye on the "Const Generics" RFC and it's tracking issue.
That will allow size dimensions as generic params, so you can suddenly explain to the compiler that a size <M,N> matrix can only be multiplied with a size <N,M> matrix. Or to explain that an Array<5> is different from an Array<6>.

4 Likes

The point I was trying to make was that indexing was dramatically slower (due to bounds checking). It makes sense that using [] is slow, but naively one might not expect that. Even looking through the book I don't see it explicitly spelled out that using [] or get is bounds checked an slow (admittedly I quickly skimmed it). Someone writing some numerical code that is iterating through a collection of data might not realize there is overhead there that they wouldn't have in C++.

I think the take away is to spend time designing the backing data structure to avoid these kinds of issues.

TBH, I can see the OP's point here. "Use libraries" is not a complete answer, because 1/someone needs to implement the libraries someday and 2/if you dig deep enough, there will always be a corner case that your library of choice is not implementing for you.

3 Likes

I am merely using matrix-matrix multiplication as an example. If I actually needed to multiply two arrays together I would use ndarray or nalgebra. This was more about exploring indexing in Rust and why it is slow compared to C++ (bounds checking). I think someone that is not super familiar with Rust would be surprised by the performance hit.

I really like Rust - cargo is awesome, it's performant, and a fun language to work with. But if I were starting a large project doing numerical type work and told everyone "By the way, don't use indexing in the code base because it is slow. Instead use this macro or wrap your code in unsafe.. etc." I think it would be a hard sell.

3 Likes

I suppose that is because of it is not always slow.
Obviously for operate with matrix 1000x1000 we you need at least 10^6 index accesses,
and obviously even if bound check possible to implement with one assembly instruction it would slow down your program (one million of extra instructions not a joke).

But each version of rustc and each version of llvm may fix your particular code.
Compiler even without usage of iterators may understand that in this particular case indexing checking is unnecessary, because of it is always true and just remove it.

So it is imposible to say that in general case you get worse performance even with bound checking.

2 Likes

It's too long since I looked but I would hope it is still the case that iterators are encouraged; this comes with it that index is implied as not the idiomatic way.

This is really a case where we're dealing with conflicting requirements.

On one hand, one hard design requirement for Rust is that it must be impossible to trigger undefined behavior using safe code. This includes the buffer over- and underflow opportunities opened by unchecked indexing, which in the wrong context (e.g. web server) can be major security issues.

On the other hand, many numerical algorithms rely on complicated matrix indexing patterns that compilers are very bad at optimizing. This makes array bound checks unpleasant in numerics, to the point where people may sometimes prefer risking UB with unchecked indexing.

As Rust is a general purpose language, and not a numerics-specific one like Fortran and Julia, it has to tread very carefully here. Which is why we have safe bound-checked indexing by default, and a scary way to disable bound checking for the brave souls who know what they are doing, and are ready to take the risk of proving the memory-safety of their code manually.

Hopefully, as compiler optimizations improve, the scary way will be needed less and less often. But I don't think that the need for it will ever fully go away.

11 Likes

Yes - it does state iterators are encourage, but as I mentioned in several replies there are cases where iterators are not feasible or clunky. In those cases indexing may be your best bet. Part of what I was looking for input on was how to handle these cases and @kornel gave a good option I was unaware of.

Oh, I sympathize :heart:
A coworker of mine did a project that needs to interfere in decompressing a file, and extract info along the way. He planned on C++, I advocated a Rust-approach (because who starts a greenfield project in C these days, and he has been interested in Rust for a while now :slight_smile: )

After some prototyping, we quickly learned that decompressing in the C-world uses an unholy amount of pointer arithmetic and indexing into byte arrays, which makes the documented algorithms basically impossible to do in Rust.
He ended up writing it in C++14/17 after all, interfacing intimately with zlib. This was the only practical solution, because neither of us have the time (or skills..) to invent a decompressor using a "rustic", iterator-and-slices-based algorithm (though I'm certain it can be done!)

Edit to add: my colleague would like to clarify that he uses "C++14/17 with classes, lambdas, a full test-suite and everything", not "just C"

3 Likes