Wherein we wrangle both autovectorized and stdsimd code to get more performance in a certain benchmarks game…
I think div_and_add doesn’t need #[inline(never)] if you use f64x2 (the SIMD ones).
Nice! Is the full code available anywhere? Namely:
One interesting wrinkle here is that the
_mm_extract_epi8intrinsic isn’t inlined, so the resulting assembly has a
IIRC, if a vendor intrinsic isn’t inlined, then you probably have a bug with your
target_feature configuration somewhere.
I think here https://github.com/TeXitoi/benchmarksgame-rs
It’s a good deal faster than our current entry and should perform about the same as the gcc enty if the benefits translate linearly to the benchmarksgame machine.
Rust #6 seems about 7 seconds slower than the fastest Rust program ?
a whopping 27% faster
Rust #5 seems about 1/3 of a second faster than the next fastest Rust program.
how this version stacks up on the benchmarksgame server
Rust #4 seems to show no result ?
@leonardo Good catch – I’ll benchmark to see if it makes any difference. Edit: Holy [insert expletive here]! This makes the code roughly run twice as fast!
@burntsushi You were right – I added the correct
#[cfg(..)]s in my code and the result was even faster. Thank you!
@igouy I found out the issue with both fannkuch_redux and n_body. In the former case I was inadvertently using an SSE4.1 instruction, which your CPU probably lacks. In the latter case, the
F64x4 autovectorization relies on the AVX instruction set to vectorize properly. I suspect your server also lacks that one.
I have sent you an updated version of the fannkuch_redux benchmark and will look into finding a better version for n_body.
do you know if the hardware/cpu is mentioned somewhere ? no AVX or SSE4.1 would make it >10y old right ?
That Q6600 CPU does indeed predates the introduction of SSE4.1 in Penryn and is a little bit above 10 years old.
On my machine this lead to a nice speedup – however on @igouy’s server it slowed down the whole thing considerably. We’ll have to back out that last change.
Which other programs use SIMD for fannkuch-redux ?
There’s a single-threaded version from the packed_simd project, however that’s slower than the current rayon-based versions on 4 cores even on a skylake.
As far as I know, my version is the first that uses both SIMD and multicore, though the C or FORTRAN versions might enjoy some autovectorization.
So, on second-thoughts, let’s not start another spiral of rewriting fannkuch-redux programs (this time to use SIMD).
The benchmarks game tasks that already have programs which use SIMD are still fair game.
iirc For many years one claim has been that Rust needed SIMD to compete on n-body, with the counter-claim that it was really all about LLVM loop-unrolling.