Surprising Auto-Vectorization/Optimization Behavior


#1

It seems like this has been discussed before, but I just wanted to report my renewed surprise.

I’m starting some work doing some performance testing of rust for use in HPC, specifically investigating rustc+LLVM’s ability to auto vectorize code, and the results on the simplest test surprised me.

Please see these simple benchmarks at this Gist.

Here are the results for VEC_SIZE=1000, building with target-cpu=native on a MacBook Pro 2017:

test bench_vector_add             ... bench:       2,255 ns/iter (+/- 384)
test bench_vector_add_slice       ... bench:       1,225 ns/iter (+/- 148)
test bench_vector_add_unsafe      ... bench:       1,763 ns/iter (+/- 219)
test bench_vector_add_zip         ... bench:          88 ns/iter (+/- 14)
test bench_vector_add_zip_collect ... bench:         114 ns/iter (+/- 35)

For VEC_SIZE=10000:

test bench_vector_add             ... bench:      22,244 ns/iter (+/- 4,870)
test bench_vector_add_slice       ... bench:      12,504 ns/iter (+/- 3,199)
test bench_vector_add_unsafe      ... bench:      17,587 ns/iter (+/- 2,059)
test bench_vector_add_zip         ... bench:       2,441 ns/iter (+/- 463)
test bench_vector_add_zip_collect ... bench:       2,590 ns/iter (+/- 325)

Sure enough, looking at the assembly, bench_vector_add_zip has loop unrolling and vector operations, but bench_vector_add is punting around with bounds checking, and is processing the arrays one element at a time. bench_vector_add_unsafe was the most surprising since it seems like it imposes minimal abstractions for the compiler to see through, but still performs poorly. Using the “slice trick” I saw elsewhere got me a little performance boost, but performance still suffers compared to the izip versions of this code.

So let’s get to the point: it is very surprising behavior that the (subjectively) simplest version of this code performs the worst in these benchmarks. Is there any reason that bench_vector_add can’t be, with sufficient compiler work, optimized to essentially parity with bench_vector_add_zip? Or is there something fundamental to the bounds checking that breaks vectorization? This very much seems like a sharp edge to me.


#2

EDIT: It seems there is an optimizer regression in current nightlies. Compare the ASM for this playground example, which I wrote for a previous topic on a similar matter, in beta and nightly builds.

I would also be curious about this one. I don’t understand why, for example, rustc doesn’t manage to extract bounds checks out of vector index loops automatically. I guess that for some reason, the optimizer doesn’t manage to prove that a vector’s length will not change during iteration (which would be why using a constant-length slice improves performance), but why doesn’t it, and can it be changed?

The first rule of code optimization is to have no observable effect on the code’s behaviour. If your code may panic at iteration N of the loop, then the optimizer is not allowed to vectorize because it could start the processing of iteration N+1 before the panic, changing the behaviour of the program with respect to the scalar version.


#3

Hm… I wonder if I should just roll my own benchmarking that works on stable. I’ll have to look into this claim that it is an optimizer regression and submit a bug report.


#4

As for

If the time of the panic is well defined, I can imagine a solution where the compiler generates two code paths - one where the bounds are verified up front and the fast code is executed, and a second code path where the code runs slowly and eventually panics.


#5

You’re totally right. I installed bluss’ bencher for stable and re-ran my benchmarks, getting these results:

VEC_SIZE=1000

test bench_vector_add             ... bench:         109 ns/iter (+/- 33)
test bench_vector_add_slice       ... bench:         105 ns/iter (+/- 33)
test bench_vector_add_unsafe      ... bench:         115 ns/iter (+/- 11)
test bench_vector_add_zip         ... bench:         450 ns/iter (+/- 55)
test bench_vector_add_zip_collect ... bench:         630 ns/iter (+/- 67)

VEC_SIZE=10000

test bench_vector_add             ... bench:       2,413 ns/iter (+/- 532)
test bench_vector_add_slice       ... bench:       2,443 ns/iter (+/- 561)
test bench_vector_add_unsafe      ... bench:       2,419 ns/iter (+/- 324)
test bench_vector_add_zip         ... bench:       5,431 ns/iter (+/- 980)
test bench_vector_add_zip_collect ... bench:       6,366 ns/iter (+/- 1,719)

The results are certainly interesting relative to their nightly versions - the equivalence is essentially stable:bench_vector_add == stable:bench_vector_add_slice == stable:bench_vector_add_unsafe == nightly:bench_vector_add_zip.

I’ll definitely file a bug with these results.


#6

It looks like there is already an issue filed for this problem: https://github.com/rust-lang/rust/issues/46863