It seems like this has been discussed before, but I just wanted to report my renewed surprise.
I’m starting some work doing some performance testing of rust for use in HPC, specifically investigating rustc+LLVM's ability to auto vectorize code, and the results on the simplest test surprised me.
Please see these simple benchmarks at this Gist.
Here are the results for VEC_SIZE=1000, building with target-cpu=native on a MacBook Pro 2017:
test bench_vector_add ... bench: 2,255 ns/iter (+/- 384)
test bench_vector_add_slice ... bench: 1,225 ns/iter (+/- 148)
test bench_vector_add_unsafe ... bench: 1,763 ns/iter (+/- 219)
test bench_vector_add_zip ... bench: 88 ns/iter (+/- 14)
test bench_vector_add_zip_collect ... bench: 114 ns/iter (+/- 35)
For VEC_SIZE=10000:
test bench_vector_add ... bench: 22,244 ns/iter (+/- 4,870)
test bench_vector_add_slice ... bench: 12,504 ns/iter (+/- 3,199)
test bench_vector_add_unsafe ... bench: 17,587 ns/iter (+/- 2,059)
test bench_vector_add_zip ... bench: 2,441 ns/iter (+/- 463)
test bench_vector_add_zip_collect ... bench: 2,590 ns/iter (+/- 325)
Sure enough, looking at the assembly, bench_vector_add_zip has loop unrolling and vector operations, but bench_vector_add is punting around with bounds checking, and is processing the arrays one element at a time. bench_vector_add_unsafe was the most surprising since it seems like it imposes minimal abstractions for the compiler to see through, but still performs poorly. Using the "slice trick" I saw elsewhere got me a little performance boost, but performance still suffers compared to the izip versions of this code.
So let's get to the point: it is very surprising behavior that the (subjectively) simplest version of this code performs the worst in these benchmarks. Is there any reason that bench_vector_add can't be, with sufficient compiler work, optimized to essentially parity with bench_vector_add_zip? Or is there something fundamental to the bounds checking that breaks vectorization? This very much seems like a sharp edge to me.