I'm mostly curious if there is something i could be doing to make it autovectorize as gcc is able to for c++
Timing: (1000x1000
80ms for parallel rust (ryzen 5800x)
400ms for serial rust
100ms for c++ (not parallel)
Project setup for running everything. i dont use an external tool for benchmarking, just use std::chrono in c++ and tempus_fugit in rust https://github.com/IndianBoy42/MATH5311
You could loook at the generated LLVM IR/assembly to figure out whether there's something really obvious that Rust or LLVM does inefficiently. LLVM is in itself capable of autovectorization, so if that is not happening, there's probably something hindering that optimization step.
When performing matrix and vector operations, one of the most frequent causes of slowdown is array bounds checking. I'm not familiar enough with ndarray to know if this is indeed what is happening here, though.
ndarray is much more flexible than a matrix flattened in an array, so it's possible that you're paying for that. You could try with a 1:1 translation, although I guess bound checks will still slow down the rust version but maybe not that much.
ps: AFAIK restrict is not valid C++. I guess you meant C?
Did you try both Clang and GCC for the C version? They often auto-vectorize code in different ways, your issue could be also related to LLVM if you are comparing the performance against GCC.
Do you need to use stable Rust or you can go nightly?
We are just a step away from min const generics, which could help a bit. I am not sure you could get all the abstractions given by ndarray, but at least you could see the performance using a very similar implementation to C with a single thread approach. It is hard to estimate if multithreading is an optimization or a pessimization in this case.
using par_azip (in certain places, not too much) made it faster, in fact it was competitive with gcc, but that didnt use all cores so its not really comparable
i am using nightly, because this isnt for a product, just playing around with different languages and writing performant code.
do you mean using const generic arrays instead of ndarray? I could do that but then it would just be a direct translation of C/C++. I wanted to see if it could be done in a higher level way using iterators
You can still use iterator features (like some interesting ones in nightly) in order to have a partial gain-gain in both performance and expressiveness.
As you can imagine, const generics is a very desired feature because it allows creating code that is both performant and expressive when you have sizes at compile-time.
But maybe you need dynamic sizes for the matrices, in this case const generics won't help you a lot. Even in this case, I think that it would be interesting to implement the C algorithm just using &[f32]s and iterators in order to see which performance you can reach.
Here's a translation 1:1, would be nice to compare it to ndarray to see if ndarray's the problem.
And here's another version with iterators that should be autovectorized. It isn't pretty nor bound-checks free, but it should get the job done.
Edit: switched the iterator version to chunks_exact instead of chunks and added a couple more assertions at the start (should suppress some assertions in the loops)
IMO parallelizing this is pretty easy, just slap par_ wherever you can (example). The problem is likely the opposite, where not to parallelize because it could slow things down (for example for short loops the overhead of parallelizing could probably cover the gains).
Unfortunately that doesn't always work. I was trying to remove other bound checks from the iterator version and I noticed that LLVM wasn't optimizing even simple repeated asserts. Proof (notice the asserts at lines 10 and 19)
Sure, I didn't try to suggest that they're never generated, I was replying to a message that was suggesting that bounds are checked on every [] (it seems to be quite a common claim).
As for why those duplicate asserts are present in the generated code, I have no idea, I guess MIR could show something useful.
Nice work - on the ndarray and slice implementation. I think we could copy your code to the ndarray bug tracker and maybe use it for some performance investigation. I'm not sure what can be done exactly - help is also wanted.
ndarray is more general, handles arrays of a more general memory layout
multidimensional indexing/slicing is also bounds checked and the chance of it optimizing out is unfortunately low
is there too much "noise" in the ndarray methods (like, for each operation it tries to find a fast path for contiguous arrays and falls back for other arrays and so on - this has a code size and complexity cost)?
low level algorithms like this should not be implemented using ndarray but using a library call - blas function or similar
And as always, you might want to check out the nalgebra crate and see how it compares.
Hi, i agree for real use it shouldn't be written in ndarray, although I was curious how well it could optimize
I tried to use zip everywhere, almost no item indexing to not have bounds checking in the hot part of the loop, so considering the delta from the autovectorized version, i think the difference is just that it uses scalar instructions only. So LLVM is unable to see through the high level abstractions to vectorize
I also tried to use dot() to replace some of the inner loops (initially i avoided it to not incur any allocations) and hoped that would compile to blas calls or vectorized dot products. And it didn't seem to have much impact.