Hi - I'm playing with optimised sort routines. I have compared code generated from some C++ and rust that both compute a 32 element sorting network. The C++ code benchmarks at about 1/3 faster than the rust code (approx 70ns vs 100ns). The assembly for the C++ is using vectorised ops, where as the rust is not.
The main difference that I can see is that the C++ code 'cheats' by providing a hand-rolled implementation of min/max. Rustc is ending up producing something that ultimately looks pretty similar to the hand-rolled assembly, but in the rust case it isn't then going on to vectorising it.
My target is i7-8700, skylark, msse4.2. I'm compiling with opitmization level 3.
So my question is two-fold. Why is the c++ code faster? Why is rustc not able to achieve this?
Also, did you benchmark with clang? Rust's backend is LLVM. A more similar result is expected.
Then, the difference seems all about instruction scheduling and register allocation. That kind of optimization is also affected by the target CPU model. Put -march=skylake for C++ and -C target-cpu=skylake for Rust to see the difference.
Thanks! That's really helpful. I benchmarked the c++ using g++ sort.cpp -O2 -o sort -msse4 which I assume uses whatever the default c++ coolchain is on my system. I benchmarked the rust with the standard nightly toolchain, opt-level3, lto=thin, debug-assertions=false, overflow-checks=false. So it may be entirely down to the c++ not going through llvm for code-gen.
Sorry for the noob errors - this is the first project where I've ever needed to shave of usecs, so it's the first time I've spent any time looking at the generated assembly. I'm used to working on the JVM, so this is all a bit of a learning experience for me