AVX2 intrisics appear to be slower than scalar loop?

I've posted on reddit here too -- This is a copy of that post.

Hi all!

I'm working on an open source project, x86-simd, and am running into some exceptionally confusing performance in benchmarks. I have two functions, both of which add 16 integers to 16 other integers (one using an AVX2 intrinsic, the other using a loop) and the loop one appears to be faster. The relevant functions are as follows:

    #[inline(never)]
    pub fn simd_vertical_add(a: u16x16, b: u16x16) -> u16x16 {    
        // SAFETY: Assume that we have AVX2 -- it should be checked outside this benchmark function.
        unsafe { u16x16::avx2_vertical_add(a, b) }
    }
    
    #[inline(never)]
    pub fn scalar_vertical_add(a: u16x16, b: u16x16) -> u16x16 {
        let mut result = [0; 16];
    
        for i in 0..16 {
            result[i] = a.as_array_ref()[i] + b.as_array_ref()[i];
        }
    
        u16x16::from_array(result)
    }

(these functions and the rest of the crate are on github here)

Upon benchmarking these functions using the criterion crate, I discover that the first one takes about 7 ns per iteration, where as the second one only takes 1.5 ns per iteration. This was very confusing to me. Using cargo-show-asm, I found that the compiler was optimizing the loop down to SSE2 SIMD instructions, which did not surprise me. What did surprise me was that they seem to be faster. I'll add the llvm-mca output below, which I'm using to try to reason about why the SSE2 version is faster than the AVX2 version:

simd version:

    cargo asm --mca -M -timeline --bench=simd_sum avx2_vertical_add
       Compiling x86-simd v0.2.0 (C:\Users\****\Documents\Projects\x86-simd)
        Finished
     `release` profile [optimized] target(s) in 11.07s
    Iterations:        100
    Instructions:      500
    Total Cycles:      249
    Total uOps:        600
    
    Dispatch Width:    4
    uOps Per Cycle:    2.41
    IPC:               2.01
    Block RThroughput: 1.5
    
    
    Instruction Info:
    [1]: #uOps
    [2]: Latency
    [3]: RThroughput
    [4]: MayLoad
    [5]: MayStore
    [6]: HasSideEffects (U)
    
    [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
     1      8     0.33    *                   vmovdqa       ymm0, ymmword ptr [r8]
     1      8     0.33    *                   vpaddw        ymm0, ymm0, ymmword ptr [rdx]
     1      1     0.33           *            vmovdqa       ymmword ptr [rcx], ymm0
     1      1     0.25                  U     vzeroupper
     2      1     0.50                  U     ret
    
    
    Resources:
    [0]   - Zn2AGU0
    [1]   - Zn2AGU1
    [2]   - Zn2AGU2
    [3]   - Zn2ALU0
    [4]   - Zn2ALU1
    [5]   - Zn2ALU2
    [6]   - Zn2ALU3
    [7]   - Zn2Divider
    [8]   - Zn2FPU0
    [9]   - Zn2FPU1
    [10]  - Zn2FPU2
    [11]  - Zn2FPU3
    [12]  - Zn2Multiplier
    
    
    Resource pressure per iteration:
    [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]
    1.00   1.00   1.00   0.50   0.49   0.50   0.51    -     0.33   0.33    -     0.34    -
    
    Resource pressure by instruction:
    [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   Instructions:
    0.21   0.06   0.73    -      -      -      -      -      -      -      -      -      -     vmovdqa      ymm0, ymmword ptr [r8]
    0.56   0.39   0.05    -      -      -      -      -     0.33   0.33    -     0.34    -     vpaddw       ymm0, ymm0, ymmword ptr [rdx]
    0.23   0.55   0.22    -      -      -      -      -      -      -      -      -      -     vmovdqa      ymmword ptr [rcx], ymm0
     -      -      -      -     0.49   0.50   0.01    -      -      -      -      -      -     vzeroupper
     -      -      -     0.50    -      -     0.50    -      -      -      -      -      -     ret
    
    
    Timeline view:
                        0123456789          01
    Index     0123456789          0123456789
    
    [0,0]     DeeeeeeeeER    .    .    .    ..   vmovdqa    ymm0, ymmword ptr [r8]
    [0,1]     D=eeeeeeeeER   .    .    .    ..   vpaddw     ymm0, ymm0, ymmword ptr [rdx]
    [0,2]     D=========eER  .    .    .    ..   vmovdqa    ymmword ptr [rcx], ymm0
    [0,3]     DeE---------R  .    .    .    ..   vzeroupper
    [0,4]     .DeE--------R  .    .    .    ..   ret
    [1,0]     .DeeeeeeeeE-R  .    .    .    ..   vmovdqa    ymm0, ymmword ptr [r8]
    [1,1]     .D=eeeeeeeeER  .    .    .    ..   vpaddw     ymm0, ymm0, ymmword ptr [rdx]
    [1,2]     . D========eER .    .    .    ..   vmovdqa    ymmword ptr [rcx], ymm0
    [1,3]     . DeE--------R .    .    .    ..   vzeroupper
    [1,4]     . DeE--------R .    .    .    ..   ret
    [2,0]     .  DeeeeeeeeER .    .    .    ..   vmovdqa    ymm0, ymmword ptr [r8]
    [2,1]     .  D=eeeeeeeeER.    .    .    ..   vpaddw     ymm0, ymm0, ymmword ptr [rdx]
    [2,2]     .  D=========eER    .    .    ..   vmovdqa    ymmword ptr [rcx], ymm0
    [2,3]     .  DeE---------R    .    .    ..   vzeroupper
    [2,4]     .   DeE--------R    .    .    ..   ret
    [3,0]     .   DeeeeeeeeE-R    .    .    ..   vmovdqa    ymm0, ymmword ptr [r8]
    [3,1]     .   D=eeeeeeeeER    .    .    ..   vpaddw     ymm0, ymm0, ymmword ptr [rdx]
    [3,2]     .    D========eER   .    .    ..   vmovdqa    ymmword ptr [rcx], ymm0
    [3,3]     .    DeE--------R   .    .    ..   vzeroupper
    [3,4]     .    DeE--------R   .    .    ..   ret
    [4,0]     .    .DeeeeeeeeER   .    .    ..   vmovdqa    ymm0, ymmword ptr [r8]
    [4,1]     .    .D=eeeeeeeeER  .    .    ..   vpaddw     ymm0, ymm0, ymmword ptr [rdx]
    [4,2]     .    .D=========eER .    .    ..   vmovdqa    ymmword ptr [rcx], ymm0
    [4,3]     .    .    . DeE---R .    .    ..   vzeroupper
    [4,4]     .    .    . DeE---R .    .    ..   ret
    [5,0]     .    .    . DeeeeeeeeER  .    ..   vmovdqa    ymm0, ymmword ptr [r8]
    [5,1]     .    .    .  DeeeeeeeeER .    ..   vpaddw     ymm0, ymm0, ymmword ptr [rdx]
    [5,2]     .    .    .  D========eER.    ..   vmovdqa    ymmword ptr [rcx], ymm0
    [5,3]     .    .    .  DeE--------R.    ..   vzeroupper
    [5,4]     .    .    .   DeE-------R.    ..   ret
    [6,0]     .    .    .   DeeeeeeeeER.    ..   vmovdqa    ymm0, ymmword ptr [r8]
    [6,1]     .    .    .   D=eeeeeeeeER    ..   vpaddw     ymm0, ymm0, ymmword ptr [rdx]
    [6,2]     .    .    .    D========eER   ..   vmovdqa    ymmword ptr [rcx], ymm0
    [6,3]     .    .    .    DeE--------R   ..   vzeroupper
    [6,4]     .    .    .    DeE--------R   ..   ret
    [7,0]     .    .    .    .DeeeeeeeeER   ..   vmovdqa    ymm0, ymmword ptr [r8]
    [7,1]     .    .    .    .D=eeeeeeeeER  ..   vpaddw     ymm0, ymm0, ymmword ptr [rdx]
    [7,2]     .    .    .    .D=========eER ..   vmovdqa    ymmword ptr [rcx], ymm0
    [7,3]     .    .    .    .DeE---------R ..   vzeroupper
    [7,4]     .    .    .    . DeE--------R ..   ret
    [8,0]     .    .    .    . DeeeeeeeeE-R ..   vmovdqa    ymm0, ymmword ptr [r8]
    [8,1]     .    .    .    . D=eeeeeeeeER ..   vpaddw     ymm0, ymm0, ymmword ptr [rdx]
    [8,2]     .    .    .    .  D========eER..   vmovdqa    ymmword ptr [rcx], ymm0
    [8,3]     .    .    .    .  DeE--------R..   vzeroupper
    [8,4]     .    .    .    .  DeE--------R..   ret
    [9,0]     .    .    .    .   DeeeeeeeeER..   vmovdqa    ymm0, ymmword ptr [r8]
    [9,1]     .    .    .    .   D=eeeeeeeeER.   vpaddw     ymm0, ymm0, ymmword ptr [rdx]
    [9,2]     .    .    .    .   D=========eER   vmovdqa    ymmword ptr [rcx], ymm0
    [9,3]     .    .    .    .    .   DeE----R   vzeroupper
    [9,4]     .    .    .    .    .   DeE----R   ret
    
    
    Average Wait times (based on the timeline view):
    [0]: Executions
    [1]: Average time spent waiting in a scheduler's queue
    [2]: Average time spent waiting in a scheduler's queue while ready
    [3]: Average time elapsed from WB until retire stage
    
          [0]    [1]    [2]    [3]
    0.     10    1.0    1.0    0.3       vmovdqa    ymm0, ymmword ptr [r8]
    1.     10    1.9    0.0    0.0       vpaddw     ymm0, ymm0, ymmword ptr [rdx]
    2.     10    9.5    0.0    0.0       vmovdqa    ymmword ptr [rcx], ymm0
    3.     10    1.0    1.0    7.4       vzeroupper
    4.     10    1.0    1.0    7.0       ret
           10    2.9    0.6    2.9       <total>
    warning: found a return instruction in the input assembly sequence.
    note: program counter updates are ignored.

scalar loop version:

    Finished `release` profile [optimized] target(s) in 0.15s
Iterations:        100
Instructions:      700
Total Cycles:      211
Total uOps:        800

Dispatch Width:    4
uOps Per Cycle:    3.79
IPC:               3.32
Block RThroughput: 2.0


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      8     0.33    *                   movdqa        xmm0, xmmword ptr [r8]
 1      8     0.33    *                   paddw xmm0, xmmword ptr [rdx]
 1      1     0.33           *            movdqa        xmmword ptr [rcx], xmm0
 1      8     0.33    *                   movdqa        xmm0, xmmword ptr [r8 + 16]
 1      8     0.33    *                   paddw xmm0, xmmword ptr [rdx + 16]
 1      1     0.33           *            movdqa        xmmword ptr [rcx + 16], xmm0
 2      1     0.50                  U     ret


Resources:
[0]   - Zn2AGU0
[1]   - Zn2AGU1
[2]   - Zn2AGU2
[3]   - Zn2ALU0
[4]   - Zn2ALU1
[5]   - Zn2ALU2
[6]   - Zn2ALU3
[7]   - Zn2Divider
[8]   - Zn2FPU0
[9]   - Zn2FPU1
[10]  - Zn2FPU2
[11]  - Zn2FPU3
[12]  - Zn2Multiplier


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]
2.00   2.00   2.00   0.50    -      -     0.50    -     0.66   0.67    -     0.67    -

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   Instructions:
0.01   0.97   0.02    -      -      -      -      -      -      -      -      -      -     movdqa       xmm0, xmmword ptr [r8]
0.50   0.49   0.01    -      -      -      -      -     0.33   0.33    -     0.34    -     paddw        xmm0, xmmword ptr [rdx]
0.01   0.02   0.97    -      -      -      -      -      -      -      -      -      -     movdqa       xmmword ptr [rcx], xmm0
0.97   0.02   0.01    -      -      -      -      -      -      -      -      -      -     movdqa       xmm0, xmmword ptr [r8 + 16]
0.49   0.01   0.50    -      -      -      -      -     0.33   0.34    -     0.33    -     paddw        xmm0, xmmword ptr [rdx + 16]
0.02   0.49   0.49    -      -      -      -      -      -      -      -      -      -     movdqa       xmmword ptr [rcx + 16], xmm0
 -      -      -     0.50    -      -     0.50    -      -      -      -      -      -     ret


Timeline view:
                    0123456789          0
Index     0123456789          0123456789

[0,0]     DeeeeeeeeER    .    .    .    .   movdqa      xmm0, xmmword ptr [r8]
[0,1]     D=eeeeeeeeER   .    .    .    .   paddw       xmm0, xmmword ptr [rdx]
[0,2]     D=========eER  .    .    .    .   movdqa      xmmword ptr [rcx], xmm0
[0,3]     DeeeeeeeeE--R  .    .    .    .   movdqa      xmm0, xmmword ptr [r8 + 16]
[0,4]     .DeeeeeeeeE-R  .    .    .    .   paddw       xmm0, xmmword ptr [rdx + 16]
[0,5]     .D========eER  .    .    .    .   movdqa      xmmword ptr [rcx + 16], xmm0
[0,6]     .DeE--------R  .    .    .    .   ret
[1,0]     . DeeeeeeeeER  .    .    .    .   movdqa      xmm0, xmmword ptr [r8]
[1,1]     . D=eeeeeeeeER .    .    .    .   paddw       xmm0, xmmword ptr [rdx]
[1,2]     . D=========eER.    .    .    .   movdqa      xmmword ptr [rcx], xmm0
[1,3]     . DeeeeeeeeE--R.    .    .    .   movdqa      xmm0, xmmword ptr [r8 + 16]
[1,4]     .  DeeeeeeeeE-R.    .    .    .   paddw       xmm0, xmmword ptr [rdx + 16]
[1,5]     .  D========eER.    .    .    .   movdqa      xmmword ptr [rcx + 16], xmm0
[1,6]     .  DeE--------R.    .    .    .   ret
[2,0]     .   DeeeeeeeeER.    .    .    .   movdqa      xmm0, xmmword ptr [r8]
[2,1]     .   D=eeeeeeeeER    .    .    .   paddw       xmm0, xmmword ptr [rdx]
[2,2]     .   D=========eER   .    .    .   movdqa      xmmword ptr [rcx], xmm0
[2,3]     .   DeeeeeeeeE--R   .    .    .   movdqa      xmm0, xmmword ptr [r8 + 16]
[2,4]     .    DeeeeeeeeE-R   .    .    .   paddw       xmm0, xmmword ptr [rdx + 16]
[2,5]     .    D========eER   .    .    .   movdqa      xmmword ptr [rcx + 16], xmm0
[2,6]     .    DeE--------R   .    .    .   ret
[3,0]     .    .DeeeeeeeeER   .    .    .   movdqa      xmm0, xmmword ptr [r8]
[3,1]     .    .D=eeeeeeeeER  .    .    .   paddw       xmm0, xmmword ptr [rdx]
[3,2]     .    .D=========eER .    .    .   movdqa      xmmword ptr [rcx], xmm0
[3,3]     .    .DeeeeeeeeE--R .    .    .   movdqa      xmm0, xmmword ptr [r8 + 16]
[3,4]     .    . DeeeeeeeeE-R .    .    .   paddw       xmm0, xmmword ptr [rdx + 16]
[3,5]     .    . D========eER .    .    .   movdqa      xmmword ptr [rcx + 16], xmm0
[3,6]     .    . DeE--------R .    .    .   ret
[4,0]     .    .  DeeeeeeeeER .    .    .   movdqa      xmm0, xmmword ptr [r8]
[4,1]     .    .  D=eeeeeeeeER.    .    .   paddw       xmm0, xmmword ptr [rdx]
[4,2]     .    .  D=========eER    .    .   movdqa      xmmword ptr [rcx], xmm0
[4,3]     .    .  DeeeeeeeeE--R    .    .   movdqa      xmm0, xmmword ptr [r8 + 16]
[4,4]     .    .   D=eeeeeeeeER    .    .   paddw       xmm0, xmmword ptr [rdx + 16]
[4,5]     .    .   D=========eER   .    .   movdqa      xmmword ptr [rcx + 16], xmm0
[4,6]     .    .   DeE---------R   .    .   ret
[5,0]     .    .    DeeeeeeeeE-R   .    .   movdqa      xmm0, xmmword ptr [r8]
[5,1]     .    .    D=eeeeeeeeER   .    .   paddw       xmm0, xmmword ptr [rdx]
[5,2]     .    .    D=========eER  .    .   movdqa      xmmword ptr [rcx], xmm0
[5,3]     .    .    DeeeeeeeeE--R  .    .   movdqa      xmm0, xmmword ptr [r8 + 16]
[5,4]     .    .    .D=eeeeeeeeER  .    .   paddw       xmm0, xmmword ptr [rdx + 16]
[5,5]     .    .    .D=========eER .    .   movdqa      xmmword ptr [rcx + 16], xmm0
[5,6]     .    .    .DeE---------R .    .   ret
[6,0]     .    .    . DeeeeeeeeE-R .    .   movdqa      xmm0, xmmword ptr [r8]
[6,1]     .    .    . D=eeeeeeeeER .    .   paddw       xmm0, xmmword ptr [rdx]
[6,2]     .    .    . D=========eER.    .   movdqa      xmmword ptr [rcx], xmm0
[6,3]     .    .    . DeeeeeeeeE--R.    .   movdqa      xmm0, xmmword ptr [r8 + 16]
[6,4]     .    .    .  D=eeeeeeeeER.    .   paddw       xmm0, xmmword ptr [rdx + 16]
[6,5]     .    .    .  D=========eER    .   movdqa      xmmword ptr [rcx + 16], xmm0
[6,6]     .    .    .  DeE---------R    .   ret
[7,0]     .    .    .   DeeeeeeeeE-R    .   movdqa      xmm0, xmmword ptr [r8]
[7,1]     .    .    .   D=eeeeeeeeER    .   paddw       xmm0, xmmword ptr [rdx]
[7,2]     .    .    .   D=========eER   .   movdqa      xmmword ptr [rcx], xmm0
[7,3]     .    .    .   DeeeeeeeeE--R   .   movdqa      xmm0, xmmword ptr [r8 + 16]
[7,4]     .    .    .    D=eeeeeeeeER   .   paddw       xmm0, xmmword ptr [rdx + 16]
[7,5]     .    .    .    D=========eER  .   movdqa      xmmword ptr [rcx + 16], xmm0
[7,6]     .    .    .    DeE---------R  .   ret
[8,0]     .    .    .    .DeeeeeeeeE-R  .   movdqa      xmm0, xmmword ptr [r8]
[8,1]     .    .    .    .D=eeeeeeeeER  .   paddw       xmm0, xmmword ptr [rdx]
[8,2]     .    .    .    .D=========eER .   movdqa      xmmword ptr [rcx], xmm0
[8,3]     .    .    .    .DeeeeeeeeE--R .   movdqa      xmm0, xmmword ptr [r8 + 16]
[8,4]     .    .    .    . DeeeeeeeeE-R .   paddw       xmm0, xmmword ptr [rdx + 16]
[8,5]     .    .    .    . D========eER .   movdqa      xmmword ptr [rcx + 16], xmm0
[8,6]     .    .    .    . DeE--------R .   ret
[9,0]     .    .    .    .  DeeeeeeeeER .   movdqa      xmm0, xmmword ptr [r8]
[9,1]     .    .    .    .  D=eeeeeeeeER.   paddw       xmm0, xmmword ptr [rdx]
[9,2]     .    .    .    .  D=========eER   movdqa      xmmword ptr [rcx], xmm0
[9,3]     .    .    .    .  DeeeeeeeeE--R   movdqa      xmm0, xmmword ptr [r8 + 16]
[9,4]     .    .    .    .   DeeeeeeeeE-R   paddw       xmm0, xmmword ptr [rdx + 16]
[9,5]     .    .    .    .   D========eER   movdqa      xmmword ptr [rcx + 16], xmm0
[9,6]     .    .    .    .   DeE--------R   ret


Average Wait times (based on the timeline view):
[0]: Executions
[1]: Average time spent waiting in a scheduler's queue
[2]: Average time spent waiting in a scheduler's queue while ready
[3]: Average time elapsed from WB until retire stage

      [0]    [1]    [2]    [3]
0.     10    1.0    1.0    0.4       movdqa     xmm0, xmmword ptr [r8]
1.     10    2.0    0.0    0.0       paddw      xmm0, xmmword ptr [rdx]
2.     10    10.0   0.0    0.0       movdqa     xmmword ptr [rcx], xmm0
3.     10    1.0    1.0    2.0       movdqa     xmm0, xmmword ptr [r8 + 16]
4.     10    1.4    0.4    0.6       paddw      xmm0, xmmword ptr [rdx + 16]
5.     10    9.4    0.0    0.0       movdqa     xmmword ptr [rcx + 16], xmm0
6.     10    1.0    1.0    8.4       ret
       10    3.7    0.5    1.6       <total>
warning: found a return instruction in the input assembly sequence.
note: program counter updates are ignored.

At this point, I assume there's either something I'm missing or some type of quirk on my AMD Ryzen 5 3600 cpu that causes AVX2 instructions to be slower than I'm expecting.

Any thoughts/advice/insights are appreciated.

1 Like

Have you properly enabled AVX2 target feature for your code? You need to do it either by RUSTFLAGS="-C target-feature=+avx2" or by using #[target_feature(enable = "avx2")]. Without it compiler will not inline intrinsics, as can be seen here.

I am not cure about this exact CPU model, but some earlier AMD CPUs had a "fake" AVX2 support based on emulation using 128 bit ALU.

What happens if you compile the scalar one with -C target-cpu=znver3? What does LLVM think is best for that processor?

1 Like

The output from llvm-mca doesn't line up with your benchmarks: it seems to suggest the scalar code is 15% faster, not 80%. There's something weird going on in the fourth iteration of the AVX2 loop. What does the benchmark code look like? Are the source and destination in cache? llvm-mca also has trouble with ret. Can you benchmark a loop that doesn't call this operation as a function, but instead inlines it? It's kind of misleading to not inline it; it's not doing anything besides two loads, an add, and a store. Usually you want to do more once your data is in a register and whatever logic is controlling your loop will have an impact when the loop body is so short.

Minor correction: I think it's znver2 for this particular processor.

2 Likes

Oh, you're right. My brain read "5600", which would be znver3, but for a 3600 it's znver2.

Yeah i'm using the latter in the function that is called by the benchmark there:

    /// "vertically" Add two SIMD values to eachother using AVX2 instructions.
    ///
    /// "vertical" means each lane of the resulting SIMD value contains the sum of the coresponding
    /// lanes of `a` and `b`.
    ///
    /// # Safety
    /// The caller must ensure that AVX2 CPU features are supported, otherwise calling this function will
    /// execute unsupoorted instructions (which is immediate undefined behaviour).
    #[cfg(any(feature = "std", target_feature = "avx2"))]
    #[target_feature(enable = "avx2")]
    pub unsafe fn avx2_vertical_add(a: Self, b: Self) -> Self {
        // Check that the number of lanes is good (this is a compile-time check triggered by seeing this const).
        Self::_MENTION_ME_TO_ASSERT_LANES_MATCH_SIZE;

        #[cfg(target_arch = "x86")]
        use core::arch::x86::*;
        #[cfg(target_arch = "x86_64")]
        use core::arch::x86_64::*;

        let result = match size_of::<S>() {
            1 => _mm256_add_epi8(a.inner.avx, b.inner.avx),
            2 => _mm256_add_epi16(a.inner.avx, b.inner.avx),
            4 => _mm256_add_epi32(a.inner.avx, b.inner.avx),
            8 => _mm256_add_epi64(a.inner.avx, b.inner.avx),
            _ => crate::unreachable_uncheched_on_release(),
        };

        Self::from_intrinsic(result)
    }

Lots of good advice -- I appreciate you all so much!

Whoever mentioned inlining had it right -- the benchmark driver function was not inlining the call
to avx2_vertical_add, and once I got that working the SIMD version performed about identically to the scalar version (which the compiler had reduced to two SSE instructions anyways).

1 Like

The scalar version is still very slightly faster: 1.2ns vs 1.6 at the moment -- not sure what exactly is causing that, but it's close enough that I might just ignore it.

Vertical Sum/SIMD (AVX2) Vertical Add/Simd256Integer { phantom: PhantomData<[u16; 16]>, inner: Simd2... #5
                        time:   [1.5915 ns 1.6238 ns 1.6583 ns]
                        change: [-77.435% -77.014% -76.591%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Vertical Sum/Scalar Vertical Add/Simd256Integer { phantom: PhantomData<[u16; 16]>, inner: Simd256Int... #5
                        time:   [1.2675 ns 1.2706 ns 1.2739 ns]
                        change: [-18.192% -17.056% -16.009%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.