Why this simple code is not autovectorized by llvm?


Why this simple code is not autovectorized by llvm?


A reduction loop (sum) of floating point can’t be vectorized even if the bounds checks are optimized out, since additions can’t be reordered with the default rules for floating point operations.

it was also mentioned here [Note that the thread is old, and the information may no longer be up to date!]


hm… i remove reduction, but still no vectorization


Yes, there are still bounds checks in the loop unfortunately. I think zip specialization should make sure using .zip() works fine here. It’s not a given even then (apparently) but here are two versions that work: (playground link)

pub fn dot_zip(x: &[f32], y: &[f32], z: &[f32], out: &mut [f32]) {
    for (&a, (&b, (&c, d))) in 
        x.iter().zip(y.iter().zip(z.iter().zip(out))) {
        *d = a * a + b * b + c * c;

use std::cmp::min;
// `unsafe`-using version for comparison
pub fn dot_unchecked(x: &[f32], y: &[f32], z: &[f32], out: &mut [f32]) {
    let mut len = min(x.len(), y.len());
    len = min(len, min(z.len(), out.len()));
    for i in 0..len {
        unsafe {
            let a = *x.get_unchecked(i);
            let b = *y.get_unchecked(i);
            let c = *z.get_unchecked(i);
            let d = out.get_unchecked_mut(i);
            *d = a * a + b * b + c * c;

How to look for bounds checks? (In case someone wonders)

Take the isolated code. Use release mode and the ASM button. There are all these panic cases that bounds checks can jump to in the code:

	leaq	panic_bounds_check_loc.4(%rip), %rdi
	movq	%r9, %rdx
	callq	_ZN4core9panicking18panic_bounds_check17h7d966cc89f07df40E@PLT
	callq	_ZN4core5slice20slice_index_len_fail17hf63c0fc1cb19cea8E@PLT
	movq	%rcx, %rsi
	callq	_ZN4core5slice20slice_index_len_fail17hf63c0fc1cb19cea8E@PLT

Ok but are they used in the loop?

This section looks like the main loop:

	.p2align	4, 0x90
	cmpq	%r9, %rsi
	jae	.LBB0_8
	movss	(%rax,%rsi,4), %xmm0
	movss	(%rdx,%rsi,4), %xmm1
	movss	(%r8,%rsi,4), %xmm2
	mulss	%xmm2, %xmm2
	mulss	%xmm0, %xmm0
	mulss	%xmm1, %xmm1
	addss	%xmm0, %xmm1
	addss	%xmm2, %xmm1
	movss	%xmm1, (%rcx,%rsi,4)
	incq	%rsi
	cmpq	%rdi, %rsi
	jb	.LBB0_4

Because the start of it is aligned, and it has an instruction that jumps back to the top: jb .LBB0_4.

The code at the start:

    cmpq    %r9, %rsi
    jae    .LBB0_8

is a bound check. jae = jump if above or equal. So it’s testing rsi < r9 and jumping to LBB0_8 (one of the panic cases from before) if the test fails.


To return to the original question, using “fast math” semantics, it does vectorize:


use std::intrinsics::{fadd_fast as fadd, fmul_fast as fmul};

pub fn dot3_fast(x: &[f32], y: &[f32], z: &[f32]) -> f32 {
    let mut sum = 0.;
    for (&a, (&b, &c)) in x.iter().zip(y.iter().zip(z)) {
        unsafe {
            sum = fadd(sum, fadd(fmul(a, a), fadd(fmul(b, b), fmul(c, c))))

Compiler output (asm) looks like this when allowing avx: https://gist.github.com/bluss/bf695d2405a4ea36fc5ab78609e57809

Edit: Ok, given the number of hearts this comment has:

This needs design and a path to stable Rust; We can’t use intrinsics in stable Rust, they aren’t even type checked!




I have a relative question. How i can create aos view to soa data? I think this approach can speed up vectorized code and keep good level of abstraction so i can use my functions like dot(a:vector3, b:vector3) inside loop and still have a good autovectorized code.

For example i have a data:

struct Vector3soa {
x: Vec<f32>,
y: Vec<f32>,
z: Vec<f32>,

I need something like that as view to my data:

struct Vector3{
x: &f32,
y: &f32,
z: &f32,
let mut Vector3aosView = Vec<Vector3>;

But i don’t understand how to deal with mutability.
Sorry for my english)


Why do you want references there?


Because i think this tells LLVM that my data is correctly aligned and it help avoid unneded loading operations…
But maybe i am wrong)