SIMD help with code slower in Release mode


I've been trying to play with a bit of SIMD in Rust and I'm seeing a bit of an odd result.

When I try to run my code in debug, I see that SIMD function is faster. But when I run it in Release mode, the normal function is faster.

Can some one explain why that is? Is it because the normal_add is vectorized by the compiler? Or am I not doing this correctly at all?

Debug mode is not a good indicator of speed at all, to the point it's counter-productive to even look at it. In the debug mode all the zero-cost abstractions are not zero cost, plus there are extra assertions at almost every point.

SystemTime is not valid for measuring duration of things. Instant is less invalid, but in general such one-shot hand-rolled tests will give you skwed results:

    let now = SystemTime::now();
    normal_add(&mut arr, 5);
    let t2 = now.elapsed().unwrap().as_nanos();

    let now = SystemTime::now();
    normal_add(&mut arr, 5);
    let t3 = now.elapsed().unwrap().as_nanos();

reports that normal_add is 13% faster than… normal_add.

Use Bencher instead, which will run and measure tests better.

Try looking at the assembly output. It's nice on, but remember to add optimization flags! -C opt-level=2

In your case it looks like you get bounds check on every [i+…] access, which completely ruins optimization. You should treat indexing with [i] in Rust as slow and problematic. Fixing off-by-one error of if i+8 < len to if i+7 < len helps LLVM eliminate some bounds checks, but OTOH for chunk in arr.chunks_exact_mut(8) is shorter, simpler, and eliminates the problem completely.


Thanks for you suggestions. I've updated my code and here is the assemply output.

Interesting to see the bencher work as well, had not used it before.

1 Like

One more thing you could try is align_to and write bigger chunks of data at a time.

1 Like

So I tried the last suggestion as well and there wasn't much of a difference. I think SIMD might be better if there were complex calculations involved but from I'm seeing, it is not giving me any worthwhile speedup.

Hi, this seemed a bit weird so I tried to do a few enhancements and the SIMD version goes ≈ 17 times faster on my computer (automatic vectorization seems to be on vacation).
Here's the changes:

  • I didn't use indexing in the chunk, I used get_unchecked, the compiler might optimize the bound check but we are making a micro benchmark so let's assume it doesn't
  • mem::transmute makes a copy and we don't need ownership so let's use pointer casting
  • again indexing isn't a good idea, I used ptr::copy_non_overlapping to avoid any bound check
  • @kornel suggested align_to_mut but you don't want to align to 32 bits, the array is probably already aligned this way, you want it aligned to 256 bits

Here's the code in the playground.


Now that you mention that we've to align to 256 bits it actually makes sense.

You've all helped me understand not only SIMD but also Rust better. Thank you so much.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.