Debug mode is not a good indicator of speed at all, to the point it's counter-productive to even look at it. In the debug mode all the zero-cost abstractions are not zero cost, plus there are extra assertions at almost every point.
SystemTime is not valid for measuring duration of things. Instant is less invalid, but in general such one-shot hand-rolled tests will give you skwed results:
let now = SystemTime::now();
normal_add(&mut arr, 5);
let t2 = now.elapsed().unwrap().as_nanos();
let now = SystemTime::now();
normal_add(&mut arr, 5);
let t3 = now.elapsed().unwrap().as_nanos();
reports that normal_add is 13% faster than⦠normal_add.
Use Bencher instead, which will run and measure tests better.
Try looking at the assembly output. It's nice on https://rust.godbolt.org, but remember to add optimization flags! -C opt-level=2
In your case it looks like you get bounds check on every[i+β¦] access, which completely ruins optimization. You should treat indexing with [i] in Rust as slow and problematic. Fixing off-by-one error of if i+8 < len to if i+7 < len helps LLVM eliminate some bounds checks, but OTOH for chunk in arr.chunks_exact_mut(8) is shorter, simpler, and eliminates the problem completely.
So I tried the last suggestion as well and there wasn't much of a difference. I think SIMD might be better if there were complex calculations involved but from I'm seeing, it is not giving me any worthwhile speedup.
Hi, this seemed a bit weird so I tried to do a few enhancements and the SIMD version goes β 17 times faster on my computer (automatic vectorization seems to be on vacation).
Here's the changes:
I didn't use indexing in the chunk, I used get_unchecked, the compiler might optimize the bound check but we are making a micro benchmark so let's assume it doesn't
mem::transmute makes a copy and we don't need ownership so let's use pointer casting
again indexing isn't a good idea, I used ptr::copy_non_overlapping to avoid any bound check
@kornel suggested align_to_mut but you don't want to align to 32 bits, the array is probably already aligned this way, you want it aligned to 256 bits