How to see auto-vectorization in action?

AndreKR · May 7, 2020, 12:48pm

When I was processing video frames with C (like or-ing them with masks) I found that I got a huge performance boost by switching from RGB pixels to RGBA pixels, just because they then aligned with 4-byte borders and the processing could be vectorized.

Now with Rust I'm trying to get a feeling for when I can rely on optimizations and when I need assembly code. So I was trying a few examples to see vectorization in action and I found no vectorization at all.

pub fn main() {
    let mut v = vec![1; 1024];
    process_it(&mut v);
    println!("{}", sum(&v)); // Just to make sure v is not discarded
}

fn sum(buf: &Vec<u32>) -> u32 {
    let mut s = 0;
    for i in buf {
        s += *i as u32;
    }
    s
}

Filling a Vec of u32:

#[inline(never)]
fn process_it(v: &mut Vec<u32>) {
    for x in 0..1024 {
        v[x] = 2
    }
}

Resulting assembly:

example::process_it:
        push    rax
        xor     ecx, ecx
.LBB3_1:
        mov     rsi, qword ptr [rdi + 16]
        cmp     rsi, rcx
        jbe     .LBB3_2
        lea     rax, [rcx + 1]
        mov     rdx, qword ptr [rdi]
        mov     dword ptr [rdx + 4*rcx], 2
        mov     rsi, qword ptr [rdi + 16]
        cmp     rsi, rax
        jbe     .LBB3_3
        mov     rax, qword ptr [rdi]
        mov     dword ptr [rax + 4*rcx + 4], 2
        add     rcx, 2
        cmp     rcx, 1024
        jne     .LBB3_1
        pop     rax
        ret
.LBB3_2:
        mov     rax, rcx
.LBB3_3:
        lea     rdx, [rip + .L__unnamed_2]
        mov     rdi, rax
        call    qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL]
        ud2

There are two optimization remarks:

note: optimization remark for slp-vectorizer at /rustc/1836e3b42a5b2f37fd79104eedbe8f48a5afdee6/src/liballoc/vec.rs:1824:9: Stores SLP vectorized with cost -1 and with tree size 2
note: optimization analysis for loop-vectorize at /rustc/1836e3b42a5b2f37fd79104eedbe8f48a5afdee6/src/libcore/iter/range.rs:211:9: loop not vectorized: loop control flow is not understood by vectorizer

Godbolt link

If I'm reading this correctly (my knowledge of assembly is rather limited) the loop was unrolled by a factor of 2, but nothing was vectorized.

Further examples I tried were...

The same but with a Vec of u8:

#[inline(never)]
fn process_it(v: &mut Vec<u8>) {
    for x in 0..1024 {
        v[x] = 2
    }
}

Godbolt link

Summing the second half of the Vec to the first half, because I believe addition can be vectorized as well:

#[inline(never)]
fn process_it(v: &mut Vec<u32>) {
    for x in 0..512 {
        v[x] += v[x+512]
    }
}

Again, loop was unrolled by 2 but nothing vectorized.

Godbolt link

Is there any working example of vectorization? Maybe I'm using the wrong compiler options? But these are no different from the ones that cargo build uses, are they?

RustyYato · May 7, 2020, 12:54pm

What happens if you out of bounds? The indexing will panic, this breaks auto-vectorization. Adding this line should improve your codegen

let slice = &mut vec[..1024];

Add this to the beginng of each function, then only operate on the slice. You should see auto-vectorization. If that doesn't work, use iterators

(You'll notice that in your first example, where you used iterators, there was auto-vectorization.)

AndreKR · May 7, 2020, 1:07pm

That works!

Why is that? Because the compiler knows that slice has already been bounds-checked but it only remembers that within one function?

I don't see it, what do you mean?

RustyYato · May 7, 2020, 1:19pm

Exactly

This function is auto-vectorized

Michael-F-Bryan · May 7, 2020, 4:26pm

The next() method for slice iteration will do pointer math to get a reference to the next element and return None (which is identical to null because of the nullable pointer optimisation) if you reach the end of the slice.

That means the bounds check and the "keep iterating" check are done in the same operation (comparing the Option<&T> to null), so we only need to pay the price of the check once. The panic path is skipped completely, because unlike when indexing, we've already either stopped iterating or guaranteed the pointer from next() is valid.

system · August 5, 2020, 4:26pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Auto-vectorization in Rust help	16	6557	January 12, 2023
How does rust get this better loop vectorization?	6	643	December 14, 2019
Rust/LLVM Applies Auto Vectorization Inconsistently help	7	992	December 10, 2021
Auto-vectorization fails in a for-loop help	6	647	October 20, 2021
Understanding Rusts Auto-Vectorization and Methods for speed increase help	5	3425	February 24, 2023

How to see auto-vectorization in action?

Related Topics