SIMD version of function runs slower than normal version

Hello,

I have been experimenting with a few simd instructions, and I've once again found myself in a odd situation.

In this case, when I pass in two f32 vectors into my simd function, it is actually slower that the naive version ?!

another oddity is that when I place my code into the rust playground, it shows the expected results of the SIMD version going much faster (kind of) than the naive version:

*note, I just realized I could have initialized the two vectors as vec![1.0; size], I've been changing things around, including using f32's instead of f64s.
Using f64s changes nothing, just takes 2x as long.

use std::time::Instant;
#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;


const SIMD_SIZE : usize = 8;

fn main() {
    let size : usize = (28*28) - 2;

    let mut dbls : Vec<f32> = Vec::with_capacity(size);
    let mut dbls2 : Vec<f32> = Vec::with_capacity(size);
    for _ in 0 .. size {
        dbls.push(1.0);
        dbls2.push(1.0);
    }


    let mut start = Instant::now();
    let mut sum : f32 = 0.0;
    for _ in 0 .. size {
        sum += mult_add_func(size, &mut dbls, &mut dbls2);
    }
    let mut duration = start.elapsed();
    println!("normal took {:?}, for {}", duration, sum);


    start = Instant::now();
    sum = 0.0;
    for _ in 0 .. size {
        sum += simd_mult_add_func(size, &mut dbls, &mut dbls2);
    }
    duration = start.elapsed();
    println!("SIMD took {:?}, for {}", duration, sum);


}

#[inline(never)]
fn mult_add_func(size : usize, dbls : &mut Vec<f32>, dbls2 : &mut Vec<f32>) -> f32 {
    let mut result : f32 = 0.0;
    for j in 0 .. size {
        result += dbls[j] * dbls2[j];
    }
    result
}

#[inline(never)]
fn simd_mult_add_func(size : usize, dbls : &mut Vec<f32>, dbls2 : &mut Vec<f32>) -> f32 {
    let mut result : f32 = 0.0;
    
    let mut accumulator : Vec<f32> = vec![0.0; SIMD_SIZE];
    let boundary : usize = (size / SIMD_SIZE) * SIMD_SIZE;
    let mut j : usize = 0;

    unsafe {
        let c = accumulator.get_unchecked_mut(0);
        let mut simd_c = _mm256_loadu_ps(c);
        while j < boundary {
            let a = dbls.get_unchecked(j); 
            let b = dbls2.get_unchecked(j); 
            let simd_a = _mm256_loadu_ps(a);
            let simd_b = _mm256_loadu_ps(b);
            simd_c = _mm256_fmadd_ps(simd_a, simd_b, simd_c);
            j += SIMD_SIZE;
        }
        _mm256_storeu_ps(accumulator.get_unchecked_mut(0), simd_c);
    }
    for i in 0 .. accumulator.len() {
        result += accumulator[i];
    }
    for i in boundary .. size {
        result += dbls[i] * dbls2[i];
    }
    result
}

(Playground)

Output:

normal took 65.42439ms, for 611524
SIMD took 10.477732ms, for 611524

Errors:

   Compiling playground v0.0.1 (/playground)
    Finished dev [unoptimized + debuginfo] target(s) in 0.89s
     Running `target/debug/playground`

When I run it on my machine, it shows this:

This does not make sense to me, especially when I modify the simd version to iterate inside the function, instead of calling the function in the loop:

use std::time::Instant;
#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;


const SIMD_SIZE : usize = 8;

fn main() {
    let size : usize = (28*28) - 2;

    let mut dbls : Vec<f32> = Vec::with_capacity(size);
    let mut dbls2 : Vec<f32> = Vec::with_capacity(size);
    for _ in 0 .. size {
        dbls.push(1.0);
        dbls2.push(1.0);
    }


    let mut start = Instant::now();
    let mut sum : f32 = 0.0;
    for _ in 0 .. size {
        sum += mult_add_func(size, &mut dbls, &mut dbls2);
    }
    let mut duration = start.elapsed();
    println!("normal took {:?}, for {}", duration, sum);


    start = Instant::now();
    sum = 0.0;
    //for _ in 0 .. size {
        sum += simd_mult_add_func(size, &mut dbls, &mut dbls2);
    //}
    duration = start.elapsed();
    println!("SIMD took {:?}, for {}", duration, sum);


}

#[inline(never)]
fn mult_add_func(size : usize, dbls : &mut Vec<f32>, dbls2 : &mut Vec<f32>) -> f32 {
    let mut result : f32 = 0.0;
    for j in 0 .. size {
        result += dbls[j] * dbls2[j];
    }
    result
}

#[inline(never)]
fn simd_mult_add_func(size : usize, dbls : &mut Vec<f32>, dbls2 : &mut Vec<f32>) -> f32 {
    let mut result : f32 = 0.0;
    
    let mut accumulator : Vec<f32> = vec![0.0; SIMD_SIZE];
    let boundary : usize = (size / SIMD_SIZE) * SIMD_SIZE;
    let mut j : usize = 0;
    for _ in 0 .. size {
        unsafe {
            let c = accumulator.get_unchecked_mut(0);
            let mut simd_c = _mm256_loadu_ps(c);
            while j < boundary {
                let a = dbls.get_unchecked(j); 
                let b = dbls2.get_unchecked(j); 
                let simd_a = _mm256_loadu_ps(a);
                let simd_b = _mm256_loadu_ps(b);
                simd_c = _mm256_fmadd_ps(simd_a, simd_b, simd_c);
                j += SIMD_SIZE;
            }
            _mm256_storeu_ps(accumulator.get_unchecked_mut(0), simd_c);
        }
        for i in 0 .. accumulator.len() {
            result += accumulator[i];
        }
        for i in boundary .. size {
            result += dbls[i] * dbls2[i];
        }
    }
    result
}

I get:

I have a Ryzen r5 2400G, it does support these instructions. And as you can see, when the loop is inside the function, I get what I expect.

Can someone help me make sense of this?

Thanks!

Does compiling with RUSTFLAGS='-C target-cpu=native -C target-feature=+avx2' change anything?

What about alignment?
AFAIK SIMD wants at least 128bit alignment.

BTW, results from Instant are noisy. Use bencher (and benchcmp) or criterion.

1 Like

Using those flags does not appear to change anything.

I have:
[build]
RUSTFLAGS='-C target-cpu=native -C target-feature=+avx2'

in my cargo.toml file. I think that's correct afaik

According to some documentation, and a other forum member, the stuff I'm using does not require alignment.

However, there are many other SIMD instructions, including ones I was using before that do.

Also, I think that I would be getting terrible results in both cases if that was the issue.

Ok, I'll try that out, and see what results I get from those, thanks!

Can someone try this on their machine?

Does it give the same results?

Your SIMD code does an allocation each call, which is not a cheap operation on such micro-benchmarks. Overall your code is quite unidiomatic, I would have written it like this. On my PC (AMD 2700X) I get the following result for it (after some CPU warm-up):

normal took 419.563µs, for 611524
SIMD took 131.368µs, for 611524

Your second version simply does less work (size times less to be exact), so no wonders you measure only 10 microseconds. Note that both your functions process the whole slices given to them, not just a part of them.

Also note that IIRC your CPU does not have a true 256-bit SIMD, instead it's emulated using 128-bit block, so performance of AVX2 code on it will be approximately equal to SSE2. Plus for _mm256_fmadd_ps you should enable the fma target feature, otherwise this intrincsic will not be inlined (should be handled by -C target-cpu=native, but still).

This is technically unsound. accumulator.get_unchecked_mut(0) gives access only to the first element of the vector. If you want that same pointer but that allows to modify all the vector elements then you should use accumulator.as_mut_ptr().

Is this actually guaranteed to be sound? I would have created a MaybeUninit<[f32; SIMD_SIZE]> and store the result in it with _mm256_storeu_ps.

1 Like

From transmute docs:

transmute is semantically equivalent to a bitwise move of one type into another. It copies the bits from the source value into the destination value, then forgets the original. It's equivalent to C's memcpy under the hood, just like transmute_copy.

And from __m256 docs:

This type is the same as the __m256 type defined by Intel, representing a 256-bit SIMD register which internally is consisted of eight packed f32 instances.

In my understanding this means that the transmute is sound.

I dropped your version in, and ran it, I got the same results:

So I must be doing something incorrect in terms of the build flags.

I have this in my cargo.toml:

I did not add anything to the source file.

The docs for cargo mentions these flags, but I can see if you have to use it on the command line, or go into windows settings and set a environment variable there, or do what I did (which might not be working apparently)

Did you set those features differently?

I was going off of this blog:

The author uses the same approach near the bottom of the post.
I thought it took a raw pointer to that index, and then unsafely assumed that there are more values stored contiguously next to it??

Maybe the author needs to have an email send to him :slight_smile:

Please, read documentation carefully. You will not be able to learn Rust programming in a reasonable time without doing so.

Build options should be specified not in the Cargo.toml, but in .cargo/config. See the following link for more information: Configuration - The Cargo Book There is a proposal to change that, but for now it's the way.

So in your case it can look like this (note that target-feature is redundant, since target-cpu=native already enables all features available on your CPU):

[build]
rustflags = ["-C", "target-cpu=native"]

Alternatively you can simply use RUSTFLAGS="-C target-cpu=native" cargo run --release, it's the most commonly used form.

1 Like

If you mean where he does the following:

let (a, b, c) = (
    x.get_unchecked(idx * 8),
    y.get_unchecked(idx * 8),
    z.get_unchecked_mut(idx * 8),
);

Then yes, that's pretty similar to what you did and in fact should have the exact same problem.

Is it possible your compiling without the —release flag? It’s been mentioned but for the mistakes we make :)). Also, have you confirmed what features your CPU supports? E.g., on a Mac sysctl -a | grep machdep.cpu.features. In my experience if the code relies on AVX2 and you don’t support it, then it can opt to perform in scalar mode (not even SSE).