How to see auto-vectorization in action?

When I was processing video frames with C (like or-ing them with masks) I found that I got a huge performance boost by switching from RGB pixels to RGBA pixels, just because they then aligned with 4-byte borders and the processing could be vectorized.

Now with Rust I'm trying to get a feeling for when I can rely on optimizations and when I need assembly code. So I was trying a few examples to see vectorization in action and I found no vectorization at all.

pub fn main() {
    let mut v = vec![1; 1024];
    process_it(&mut v);
    println!("{}", sum(&v)); // Just to make sure v is not discarded

fn sum(buf: &Vec<u32>) -> u32 {
    let mut s = 0;
    for i in buf {
        s += *i as u32;

Filling a Vec of u32:

fn process_it(v: &mut Vec<u32>) {
    for x in 0..1024 {
        v[x] = 2

Resulting assembly:

        push    rax
        xor     ecx, ecx
        mov     rsi, qword ptr [rdi + 16]
        cmp     rsi, rcx
        jbe     .LBB3_2
        lea     rax, [rcx + 1]
        mov     rdx, qword ptr [rdi]
        mov     dword ptr [rdx + 4*rcx], 2
        mov     rsi, qword ptr [rdi + 16]
        cmp     rsi, rax
        jbe     .LBB3_3
        mov     rax, qword ptr [rdi]
        mov     dword ptr [rax + 4*rcx + 4], 2
        add     rcx, 2
        cmp     rcx, 1024
        jne     .LBB3_1
        pop     rax
        mov     rax, rcx
        lea     rdx, [rip + .L__unnamed_2]
        mov     rdi, rax
        call    qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL]

There are two optimization remarks:

note: optimization remark for slp-vectorizer at /rustc/1836e3b42a5b2f37fd79104eedbe8f48a5afdee6/src/liballoc/ Stores SLP vectorized with cost -1 and with tree size 2
note: optimization analysis for loop-vectorize at /rustc/1836e3b42a5b2f37fd79104eedbe8f48a5afdee6/src/libcore/iter/ loop not vectorized: loop control flow is not understood by vectorizer

Godbolt link

If I'm reading this correctly (my knowledge of assembly is rather limited) the loop was unrolled by a factor of 2, but nothing was vectorized.

Further examples I tried were...

The same but with a Vec of u8:

fn process_it(v: &mut Vec<u8>) {
    for x in 0..1024 {
        v[x] = 2

Godbolt link

Summing the second half of the Vec to the first half, because I believe addition can be vectorized as well:

fn process_it(v: &mut Vec<u32>) {
    for x in 0..512 {
        v[x] += v[x+512]

Again, loop was unrolled by 2 but nothing vectorized.

Godbolt link

Is there any working example of vectorization? Maybe I'm using the wrong compiler options? But these are no different from the ones that cargo build uses, are they?

What happens if you out of bounds? The indexing will panic, this breaks auto-vectorization. Adding this line should improve your codegen

let slice = &mut vec[..1024];

Add this to the beginng of each function, then only operate on the slice. You should see auto-vectorization. If that doesn't work, use iterators

(You'll notice that in your first example, where you used iterators, there was auto-vectorization.)


That works!

Why is that? Because the compiler knows that slice has already been bounds-checked but it only remembers that within one function?

I don't see it, what do you mean?


This function is auto-vectorized

The next() method for slice iteration will do pointer math to get a reference to the next element and return None (which is identical to null because of the nullable pointer optimisation) if you reach the end of the slice.

That means the bounds check and the "keep iterating" check are done in the same operation (comparing the Option<&T> to null), so we only need to pay the price of the check once. The panic path is skipped completely, because unlike when indexing, we've already either stopped iterating or guaranteed the pointer from next() is valid.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.