SIMD help with code slower in Release mode


I've been trying to play with a bit of SIMD in Rust and I'm seeing a bit of an odd result.

When I try to run my code in debug, I see that SIMD function is faster. But when I run it in Release mode, the normal function is faster.

Can some one explain why that is? Is it because the normal_add is vectorized by the compiler? Or am I not doing this correctly at all?

Debug mode is not a good indicator of speed at all, to the point it's counter-productive to even look at it. In the debug mode all the zero-cost abstractions are not zero cost, plus there are extra assertions at almost every point.

SystemTime is not valid for measuring duration of things. Instant is less invalid, but in general such one-shot hand-rolled tests will give you skwed results:

    let now = SystemTime::now();
    normal_add(&mut arr, 5);
    let t2 = now.elapsed().unwrap().as_nanos();

    let now = SystemTime::now();
    normal_add(&mut arr, 5);
    let t3 = now.elapsed().unwrap().as_nanos();

reports that normal_add is 13% faster than… normal_add.

Use Bencher instead, which will run and measure tests better.

Try looking at the assembly output. It's nice on, but remember to add optimization flags! -C opt-level=2

In your case it looks like you get bounds check on every [i+…] access, which completely ruins optimization. You should treat indexing with [i] in Rust as slow and problematic. Fixing off-by-one error of if i+8 < len to if i+7 < len helps LLVM eliminate some bounds checks, but OTOH for chunk in arr.chunks_exact_mut(8) is shorter, simpler, and eliminates the problem completely.


Thanks for you suggestions. I've updated my code and here is the assemply output.

Interesting to see the bencher work as well, had not used it before.

1 Like

One more thing you could try is align_to and write bigger chunks of data at a time.

1 Like