Vectorisation for min/max sorting

Hi - I'm playing with optimised sort routines. I have compared code generated from some C++ and rust that both compute a 32 element sorting network. The C++ code benchmarks at about 1/3 faster than the rust code (approx 70ns vs 100ns). The assembly for the C++ is using vectorised ops, where as the rust is not.

c++: https://rust.godbolt.org/z/prJRnu rust: https://rust.godbolt.org/z/mQxUgk

The main difference that I can see is that the C++ code 'cheats' by providing a hand-rolled implementation of min/max. Rustc is ending up producing something that ultimately looks pretty similar to the hand-rolled assembly, but in the rust case it isn't then going on to vectorising it.

My target is i7-8700, skylark, msse4.2. I'm compiling with opitmization level 3.

So my question is two-fold. Why is the c++ code faster? Why is rustc not able to achieve this?

Please clean your code before doing a comparison. In fact, there are multiple differences between your C++ code and Rust code.

  • In C++, inKey is copied to dstKeys but in Rust, the operation is all computed on a &mut [u32; 32] without copy.
  • C++ uses the signed int but Rust uses unsigned u32.
  • Not affecting the assembly but:
    • inVals and dstVals in C++ are unused.
    • bool WithValues template argument is also unused.
    • Unnecessary function indirection (_SortingNetwork_16 and SortingNetwork_16).

In fact, SSE instructions are only used for memory copy and that is why those instructions are not seen in the Rust assembly. You are misled!

I cleaned the codes for you https://rust.godbolt.org/z/zxfX2M. The assemblies are more similar now.

Also, did you benchmark with clang? Rust's backend is LLVM. A more similar result is expected.

Then, the difference seems all about instruction scheduling and register allocation. That kind of optimization is also affected by the target CPU model. Put -march=skylake for C++ and -C target-cpu=skylake for Rust to see the difference.

3 Likes

Thanks! That's really helpful. I benchmarked the c++ using g++ sort.cpp -O2 -o sort -msse4 which I assume uses whatever the default c++ coolchain is on my system. I benchmarked the rust with the standard nightly toolchain, opt-level3, lto=thin, debug-assertions=false, overflow-checks=false. So it may be entirely down to the c++ not going through llvm for code-gen.

Sorry for the noob errors - this is the first project where I've ever needed to shave of usecs, so it's the first time I've spent any time looking at the generated assembly. I'm used to working on the JVM, so this is all a bit of a learning experience for me :wink:

1 Like
  1. g++ is the c++ compiler of gcc, not clang.
  2. You are compiling the c++ code with -msse4. For rustc you would need to use -Ctarget-feature=+sse4.
1 Like