Performance question

Im a college kid learning Rust and diving into implementing high performance math.

I tried implementing Gaussian Elimination in both Rust and C++, The Rust code is ~4x slower on my computer (using target-cpu=native / march=native)

RUST: https://gist.github.com/c81671777f34ef85854699316331d352
(with par_azip as shown in the gist it beats cpp, but thats using 16 cores vs 1)

CPP: https://gist.github.com/1e924211b0a2d7c05b9585928b8d5dbf

I'm mostly curious if there is something i could be doing to make it autovectorize as gcc is able to for c++

Timing: (1000x1000
80ms for parallel rust (ryzen 5800x)
400ms for serial rust
100ms for c++ (not parallel)

Project setup for running everything. i dont use an external tool for benchmarking, just use std::chrono in c++ and tempus_fugit in rust

You could loook at the generated LLVM IR/assembly to figure out whether there's something really obvious that Rust or LLVM does inefficiently. LLVM is in itself capable of autovectorization, so if that is not happening, there's probably something hindering that optimization step.

When performing matrix and vector operations, one of the most frequent causes of slowdown is array bounds checking. I'm not familiar enough with ndarray to know if this is indeed what is happening here, though.

1 Like

ndarray is much more flexible than a matrix flattened in an array, so it's possible that you're paying for that. You could try with a 1:1 translation, although I guess bound checks will still slow down the rust version but maybe not that much.

ps: AFAIK restrict is not valid C++. I guess you meant C?

Did you try both Clang and GCC for the C version? They often auto-vectorize code in different ways, your issue could be also related to LLVM if you are comparing the performance against GCC.

oh yes, i used it from c++ by defining out the restrict

Honestly I was hoping that with zip and iterators and whatnot I could be competitive with c++, I was hoping for zero cost abstractions.

Maybe I should look for other array libraries

I did try clang, it takes about the same amount of time, a little slower, but not to the extent of rust

1 Like

Do you need to use stable Rust or you can go nightly?

We are just a step away from min const generics, which could help a bit. I am not sure you could get all the abstractions given by ndarray, but at least you could see the performance using a very similar implementation to C with a single thread approach. It is hard to estimate if multithreading is an optimization or a pessimization in this case.

using par_azip (in certain places, not too much) made it faster, in fact it was competitive with gcc, but that didnt use all cores so its not really comparable

i am using nightly, because this isnt for a product, just playing around with different languages and writing performant code.

do you mean using const generic arrays instead of ndarray? I could do that but then it would just be a direct translation of C/C++. I wanted to see if it could be done in a higher level way using iterators

Exactly!

You can still use iterator features (like some interesting ones in nightly) in order to have a partial gain-gain in both performance and expressiveness.

As you can imagine, const generics is a very desired feature because it allows creating code that is both performant and expressive when you have sizes at compile-time.

But maybe you need dynamic sizes for the matrices, in this case const generics won't help you a lot. Even in this case, I think that it would be interesting to implement the C algorithm just using &[f32]s and iterators in order to see which performance you can reach.

Here's a translation 1:1, would be nice to compare it to ndarray to see if ndarray's the problem.

And here's another version with iterators that should be autovectorized. It isn't pretty nor bound-checks free, but it should get the job done.

Edit: switched the iterator version to chunks_exact instead of chunks and added a couple more assertions at the start (should suppress some assertions in the loops)

1 Like

What timing, I just got finished testing my own implementation using slices for the matrix, although i use a lot of iterators

This is competitive with gcc in serial

now to see if i can easily parallelize it with rayon

The indexing solution didnt autovectorize, I suppose because of the bounds checking on every []. But the iterator solution has the same speed as gcc

1 Like

IMO parallelizing this is pretty easy, just slap par_ wherever you can (example). The problem is likely the opposite, where not to parallelize because it could slow things down (for example for short loops the overhead of parallelizing could probably cover the gains).

Btw, how much "competitive" are we talking about?

For solving Ax=A
where A is a 1000x1000 matrix

both gcc and rust take 80ms (in my not super scientific benchmarking setup)

on a ryzen 5800x

Bounds checks are not necessarily unconditional. You can do the check yourself and then the optimiser won't necessarily check it again.

For example in case of Compiler Explorer :

  • foo(): there are 3 bounds checks (and 3 panics)
  • bar(): there are no bounds checks or panics except the manual one

Unfortunately that doesn't always work. I was trying to remove other bound checks from the iterator version and I noticed that LLVM wasn't optimizing even simple repeated asserts. Proof (notice the asserts at lines 10 and 19)

2 Likes

Sure, I didn't try to suggest that they're never generated, I was replying to a message that was suggesting that bounds are checked on every [] (it seems to be quite a common claim).

As for why those duplicate asserts are present in the generated code, I have no idea, I guess MIR could show something useful.

Nice work - on the ndarray and slice implementation. I think we could copy your code to the ndarray bug tracker and maybe use it for some performance investigation. I'm not sure what can be done exactly - help is also wanted.

  • ndarray is more general, handles arrays of a more general memory layout
  • multidimensional indexing/slicing is also bounds checked and the chance of it optimizing out is unfortunately low
  • is there too much "noise" in the ndarray methods (like, for each operation it tries to find a fast path for contiguous arrays and falls back for other arrays and so on - this has a code size and complexity cost)?
  • low level algorithms like this should not be implemented using ndarray but using a library call - blas function or similar

And as always, you might want to check out the nalgebra crate and see how it compares.

Hi, i agree for real use it shouldn't be written in ndarray, although I was curious how well it could optimize

I tried to use zip everywhere, almost no item indexing to not have bounds checking in the hot part of the loop, so considering the delta from the autovectorized version, i think the difference is just that it uses scalar instructions only. So LLVM is unable to see through the high level abstractions to vectorize

I also tried to use dot() to replace some of the inner loops (initially i avoided it to not incur any allocations) and hoped that would compile to blas calls or vectorized dot products. And it didn't seem to have much impact.

Do you want me to open an issue?