Hello,

I have been experimenting with a few simd instructions, and I've once again found myself in a odd situation.

In this case, when I pass in two f32 vectors into my simd function, it is actually slower that the naive version ?!

another oddity is that when I place my code into the rust playground, it shows the expected results of the SIMD version going much faster (kind of) than the naive version:

*note, I just realized I could have initialized the two vectors as `vec![1.0; size]`

, I've been changing things around, including using f32's instead of f64s.

Using f64s changes nothing, just takes 2x as long.

```
use std::time::Instant;
#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;
const SIMD_SIZE : usize = 8;
fn main() {
let size : usize = (28*28) - 2;
let mut dbls : Vec<f32> = Vec::with_capacity(size);
let mut dbls2 : Vec<f32> = Vec::with_capacity(size);
for _ in 0 .. size {
dbls.push(1.0);
dbls2.push(1.0);
}
let mut start = Instant::now();
let mut sum : f32 = 0.0;
for _ in 0 .. size {
sum += mult_add_func(size, &mut dbls, &mut dbls2);
}
let mut duration = start.elapsed();
println!("normal took {:?}, for {}", duration, sum);
start = Instant::now();
sum = 0.0;
for _ in 0 .. size {
sum += simd_mult_add_func(size, &mut dbls, &mut dbls2);
}
duration = start.elapsed();
println!("SIMD took {:?}, for {}", duration, sum);
}
#[inline(never)]
fn mult_add_func(size : usize, dbls : &mut Vec<f32>, dbls2 : &mut Vec<f32>) -> f32 {
let mut result : f32 = 0.0;
for j in 0 .. size {
result += dbls[j] * dbls2[j];
}
result
}
#[inline(never)]
fn simd_mult_add_func(size : usize, dbls : &mut Vec<f32>, dbls2 : &mut Vec<f32>) -> f32 {
let mut result : f32 = 0.0;
let mut accumulator : Vec<f32> = vec![0.0; SIMD_SIZE];
let boundary : usize = (size / SIMD_SIZE) * SIMD_SIZE;
let mut j : usize = 0;
unsafe {
let c = accumulator.get_unchecked_mut(0);
let mut simd_c = _mm256_loadu_ps(c);
while j < boundary {
let a = dbls.get_unchecked(j);
let b = dbls2.get_unchecked(j);
let simd_a = _mm256_loadu_ps(a);
let simd_b = _mm256_loadu_ps(b);
simd_c = _mm256_fmadd_ps(simd_a, simd_b, simd_c);
j += SIMD_SIZE;
}
_mm256_storeu_ps(accumulator.get_unchecked_mut(0), simd_c);
}
for i in 0 .. accumulator.len() {
result += accumulator[i];
}
for i in boundary .. size {
result += dbls[i] * dbls2[i];
}
result
}
```

Output:

```
normal took 65.42439ms, for 611524
SIMD took 10.477732ms, for 611524
```

Errors:

```
Compiling playground v0.0.1 (/playground)
Finished dev [unoptimized + debuginfo] target(s) in 0.89s
Running `target/debug/playground`
```

When I run it on my machine, it shows this:

This does not make sense to me, especially when I modify the simd version to iterate inside the function, instead of calling the function in the loop:

```
use std::time::Instant;
#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;
const SIMD_SIZE : usize = 8;
fn main() {
let size : usize = (28*28) - 2;
let mut dbls : Vec<f32> = Vec::with_capacity(size);
let mut dbls2 : Vec<f32> = Vec::with_capacity(size);
for _ in 0 .. size {
dbls.push(1.0);
dbls2.push(1.0);
}
let mut start = Instant::now();
let mut sum : f32 = 0.0;
for _ in 0 .. size {
sum += mult_add_func(size, &mut dbls, &mut dbls2);
}
let mut duration = start.elapsed();
println!("normal took {:?}, for {}", duration, sum);
start = Instant::now();
sum = 0.0;
//for _ in 0 .. size {
sum += simd_mult_add_func(size, &mut dbls, &mut dbls2);
//}
duration = start.elapsed();
println!("SIMD took {:?}, for {}", duration, sum);
}
#[inline(never)]
fn mult_add_func(size : usize, dbls : &mut Vec<f32>, dbls2 : &mut Vec<f32>) -> f32 {
let mut result : f32 = 0.0;
for j in 0 .. size {
result += dbls[j] * dbls2[j];
}
result
}
#[inline(never)]
fn simd_mult_add_func(size : usize, dbls : &mut Vec<f32>, dbls2 : &mut Vec<f32>) -> f32 {
let mut result : f32 = 0.0;
let mut accumulator : Vec<f32> = vec![0.0; SIMD_SIZE];
let boundary : usize = (size / SIMD_SIZE) * SIMD_SIZE;
let mut j : usize = 0;
for _ in 0 .. size {
unsafe {
let c = accumulator.get_unchecked_mut(0);
let mut simd_c = _mm256_loadu_ps(c);
while j < boundary {
let a = dbls.get_unchecked(j);
let b = dbls2.get_unchecked(j);
let simd_a = _mm256_loadu_ps(a);
let simd_b = _mm256_loadu_ps(b);
simd_c = _mm256_fmadd_ps(simd_a, simd_b, simd_c);
j += SIMD_SIZE;
}
_mm256_storeu_ps(accumulator.get_unchecked_mut(0), simd_c);
}
for i in 0 .. accumulator.len() {
result += accumulator[i];
}
for i in boundary .. size {
result += dbls[i] * dbls2[i];
}
}
result
}
```

I get:

I have a Ryzen r5 2400G, it does support these instructions. And as you can see, when the loop is inside the function, I get what I expect.

Can someone help me make sense of this?

Thanks!