When I was processing video frames with C (like or-ing them with masks) I found that I got a huge performance boost by switching from RGB pixels to RGBA pixels, just because they then aligned with 4-byte borders and the processing could be vectorized.
Now with Rust I'm trying to get a feeling for when I can rely on optimizations and when I need assembly code. So I was trying a few examples to see vectorization in action and I found no vectorization at all.
pub fn main() {
let mut v = vec![1; 1024];
process_it(&mut v);
println!("{}", sum(&v)); // Just to make sure v is not discarded
}
fn sum(buf: &Vec<u32>) -> u32 {
let mut s = 0;
for i in buf {
s += *i as u32;
}
s
}
Filling a Vec of u32:
#[inline(never)]
fn process_it(v: &mut Vec<u32>) {
for x in 0..1024 {
v[x] = 2
}
}
Resulting assembly:
example::process_it:
push rax
xor ecx, ecx
.LBB3_1:
mov rsi, qword ptr [rdi + 16]
cmp rsi, rcx
jbe .LBB3_2
lea rax, [rcx + 1]
mov rdx, qword ptr [rdi]
mov dword ptr [rdx + 4*rcx], 2
mov rsi, qword ptr [rdi + 16]
cmp rsi, rax
jbe .LBB3_3
mov rax, qword ptr [rdi]
mov dword ptr [rax + 4*rcx + 4], 2
add rcx, 2
cmp rcx, 1024
jne .LBB3_1
pop rax
ret
.LBB3_2:
mov rax, rcx
.LBB3_3:
lea rdx, [rip + .L__unnamed_2]
mov rdi, rax
call qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL]
ud2
There are two optimization remarks:
note: optimization remark for slp-vectorizer at /rustc/1836e3b42a5b2f37fd79104eedbe8f48a5afdee6/src/liballoc/vec.rs:1824:9: Stores SLP vectorized with cost -1 and with tree size 2
note: optimization analysis for loop-vectorize at /rustc/1836e3b42a5b2f37fd79104eedbe8f48a5afdee6/src/libcore/iter/range.rs:211:9: loop not vectorized: loop control flow is not understood by vectorizer
If I'm reading this correctly (my knowledge of assembly is rather limited) the loop was unrolled by a factor of 2, but nothing was vectorized.
Further examples I tried were...
The same but with a Vec of u8:
#[inline(never)]
fn process_it(v: &mut Vec<u8>) {
for x in 0..1024 {
v[x] = 2
}
}
Summing the second half of the Vec to the first half, because I believe addition can be vectorized as well:
#[inline(never)]
fn process_it(v: &mut Vec<u32>) {
for x in 0..512 {
v[x] += v[x+512]
}
}
Again, loop was unrolled by 2 but nothing vectorized.
Is there any working example of vectorization? Maybe I'm using the wrong compiler options? But these are no different from the ones that cargo build
uses, are they?