I've posted on reddit here too -- This is a copy of that post.
Hi all!
I'm working on an open source project, x86-simd, and am running into some exceptionally confusing performance in benchmarks. I have two functions, both of which add 16 integers to 16 other integers (one using an AVX2 intrinsic, the other using a loop) and the loop one appears to be faster. The relevant functions are as follows:
#[inline(never)]
pub fn simd_vertical_add(a: u16x16, b: u16x16) -> u16x16 {
// SAFETY: Assume that we have AVX2 -- it should be checked outside this benchmark function.
unsafe { u16x16::avx2_vertical_add(a, b) }
}
#[inline(never)]
pub fn scalar_vertical_add(a: u16x16, b: u16x16) -> u16x16 {
let mut result = [0; 16];
for i in 0..16 {
result[i] = a.as_array_ref()[i] + b.as_array_ref()[i];
}
u16x16::from_array(result)
}
(these functions and the rest of the crate are on github here)
Upon benchmarking these functions using the criterion crate, I discover that the first one takes about 7 ns per iteration, where as the second one only takes 1.5 ns per iteration. This was very confusing to me. Using cargo-show-asm, I found that the compiler was optimizing the loop down to SSE2 SIMD instructions, which did not surprise me. What did surprise me was that they seem to be faster. I'll add the llvm-mca
output below, which I'm using to try to reason about why the SSE2 version is faster than the AVX2 version:
simd version:
cargo asm --mca -M -timeline --bench=simd_sum avx2_vertical_add
Compiling x86-simd v0.2.0 (C:\Users\****\Documents\Projects\x86-simd)
Finished
`release` profile [optimized] target(s) in 11.07s
Iterations: 100
Instructions: 500
Total Cycles: 249
Total uOps: 600
Dispatch Width: 4
uOps Per Cycle: 2.41
IPC: 2.01
Block RThroughput: 1.5
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 8 0.33 * vmovdqa ymm0, ymmword ptr [r8]
1 8 0.33 * vpaddw ymm0, ymm0, ymmword ptr [rdx]
1 1 0.33 * vmovdqa ymmword ptr [rcx], ymm0
1 1 0.25 U vzeroupper
2 1 0.50 U ret
Resources:
[0] - Zn2AGU0
[1] - Zn2AGU1
[2] - Zn2AGU2
[3] - Zn2ALU0
[4] - Zn2ALU1
[5] - Zn2ALU2
[6] - Zn2ALU3
[7] - Zn2Divider
[8] - Zn2FPU0
[9] - Zn2FPU1
[10] - Zn2FPU2
[11] - Zn2FPU3
[12] - Zn2Multiplier
Resource pressure per iteration:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]
1.00 1.00 1.00 0.50 0.49 0.50 0.51 - 0.33 0.33 - 0.34 -
Resource pressure by instruction:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] Instructions:
0.21 0.06 0.73 - - - - - - - - - - vmovdqa ymm0, ymmword ptr [r8]
0.56 0.39 0.05 - - - - - 0.33 0.33 - 0.34 - vpaddw ymm0, ymm0, ymmword ptr [rdx]
0.23 0.55 0.22 - - - - - - - - - - vmovdqa ymmword ptr [rcx], ymm0
- - - - 0.49 0.50 0.01 - - - - - - vzeroupper
- - - 0.50 - - 0.50 - - - - - - ret
Timeline view:
0123456789 01
Index 0123456789 0123456789
[0,0] DeeeeeeeeER . . . .. vmovdqa ymm0, ymmword ptr [r8]
[0,1] D=eeeeeeeeER . . . .. vpaddw ymm0, ymm0, ymmword ptr [rdx]
[0,2] D=========eER . . . .. vmovdqa ymmword ptr [rcx], ymm0
[0,3] DeE---------R . . . .. vzeroupper
[0,4] .DeE--------R . . . .. ret
[1,0] .DeeeeeeeeE-R . . . .. vmovdqa ymm0, ymmword ptr [r8]
[1,1] .D=eeeeeeeeER . . . .. vpaddw ymm0, ymm0, ymmword ptr [rdx]
[1,2] . D========eER . . . .. vmovdqa ymmword ptr [rcx], ymm0
[1,3] . DeE--------R . . . .. vzeroupper
[1,4] . DeE--------R . . . .. ret
[2,0] . DeeeeeeeeER . . . .. vmovdqa ymm0, ymmword ptr [r8]
[2,1] . D=eeeeeeeeER. . . .. vpaddw ymm0, ymm0, ymmword ptr [rdx]
[2,2] . D=========eER . . .. vmovdqa ymmword ptr [rcx], ymm0
[2,3] . DeE---------R . . .. vzeroupper
[2,4] . DeE--------R . . .. ret
[3,0] . DeeeeeeeeE-R . . .. vmovdqa ymm0, ymmword ptr [r8]
[3,1] . D=eeeeeeeeER . . .. vpaddw ymm0, ymm0, ymmword ptr [rdx]
[3,2] . D========eER . . .. vmovdqa ymmword ptr [rcx], ymm0
[3,3] . DeE--------R . . .. vzeroupper
[3,4] . DeE--------R . . .. ret
[4,0] . .DeeeeeeeeER . . .. vmovdqa ymm0, ymmword ptr [r8]
[4,1] . .D=eeeeeeeeER . . .. vpaddw ymm0, ymm0, ymmword ptr [rdx]
[4,2] . .D=========eER . . .. vmovdqa ymmword ptr [rcx], ymm0
[4,3] . . . DeE---R . . .. vzeroupper
[4,4] . . . DeE---R . . .. ret
[5,0] . . . DeeeeeeeeER . .. vmovdqa ymm0, ymmword ptr [r8]
[5,1] . . . DeeeeeeeeER . .. vpaddw ymm0, ymm0, ymmword ptr [rdx]
[5,2] . . . D========eER. .. vmovdqa ymmword ptr [rcx], ymm0
[5,3] . . . DeE--------R. .. vzeroupper
[5,4] . . . DeE-------R. .. ret
[6,0] . . . DeeeeeeeeER. .. vmovdqa ymm0, ymmword ptr [r8]
[6,1] . . . D=eeeeeeeeER .. vpaddw ymm0, ymm0, ymmword ptr [rdx]
[6,2] . . . D========eER .. vmovdqa ymmword ptr [rcx], ymm0
[6,3] . . . DeE--------R .. vzeroupper
[6,4] . . . DeE--------R .. ret
[7,0] . . . .DeeeeeeeeER .. vmovdqa ymm0, ymmword ptr [r8]
[7,1] . . . .D=eeeeeeeeER .. vpaddw ymm0, ymm0, ymmword ptr [rdx]
[7,2] . . . .D=========eER .. vmovdqa ymmword ptr [rcx], ymm0
[7,3] . . . .DeE---------R .. vzeroupper
[7,4] . . . . DeE--------R .. ret
[8,0] . . . . DeeeeeeeeE-R .. vmovdqa ymm0, ymmword ptr [r8]
[8,1] . . . . D=eeeeeeeeER .. vpaddw ymm0, ymm0, ymmword ptr [rdx]
[8,2] . . . . D========eER.. vmovdqa ymmword ptr [rcx], ymm0
[8,3] . . . . DeE--------R.. vzeroupper
[8,4] . . . . DeE--------R.. ret
[9,0] . . . . DeeeeeeeeER.. vmovdqa ymm0, ymmword ptr [r8]
[9,1] . . . . D=eeeeeeeeER. vpaddw ymm0, ymm0, ymmword ptr [rdx]
[9,2] . . . . D=========eER vmovdqa ymmword ptr [rcx], ymm0
[9,3] . . . . . DeE----R vzeroupper
[9,4] . . . . . DeE----R ret
Average Wait times (based on the timeline view):
[0]: Executions
[1]: Average time spent waiting in a scheduler's queue
[2]: Average time spent waiting in a scheduler's queue while ready
[3]: Average time elapsed from WB until retire stage
[0] [1] [2] [3]
0. 10 1.0 1.0 0.3 vmovdqa ymm0, ymmword ptr [r8]
1. 10 1.9 0.0 0.0 vpaddw ymm0, ymm0, ymmword ptr [rdx]
2. 10 9.5 0.0 0.0 vmovdqa ymmword ptr [rcx], ymm0
3. 10 1.0 1.0 7.4 vzeroupper
4. 10 1.0 1.0 7.0 ret
10 2.9 0.6 2.9 <total>
warning: found a return instruction in the input assembly sequence.
note: program counter updates are ignored.
scalar loop version:
Finished `release` profile [optimized] target(s) in 0.15s
Iterations: 100
Instructions: 700
Total Cycles: 211
Total uOps: 800
Dispatch Width: 4
uOps Per Cycle: 3.79
IPC: 3.32
Block RThroughput: 2.0
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 8 0.33 * movdqa xmm0, xmmword ptr [r8]
1 8 0.33 * paddw xmm0, xmmword ptr [rdx]
1 1 0.33 * movdqa xmmword ptr [rcx], xmm0
1 8 0.33 * movdqa xmm0, xmmword ptr [r8 + 16]
1 8 0.33 * paddw xmm0, xmmword ptr [rdx + 16]
1 1 0.33 * movdqa xmmword ptr [rcx + 16], xmm0
2 1 0.50 U ret
Resources:
[0] - Zn2AGU0
[1] - Zn2AGU1
[2] - Zn2AGU2
[3] - Zn2ALU0
[4] - Zn2ALU1
[5] - Zn2ALU2
[6] - Zn2ALU3
[7] - Zn2Divider
[8] - Zn2FPU0
[9] - Zn2FPU1
[10] - Zn2FPU2
[11] - Zn2FPU3
[12] - Zn2Multiplier
Resource pressure per iteration:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]
2.00 2.00 2.00 0.50 - - 0.50 - 0.66 0.67 - 0.67 -
Resource pressure by instruction:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] Instructions:
0.01 0.97 0.02 - - - - - - - - - - movdqa xmm0, xmmword ptr [r8]
0.50 0.49 0.01 - - - - - 0.33 0.33 - 0.34 - paddw xmm0, xmmword ptr [rdx]
0.01 0.02 0.97 - - - - - - - - - - movdqa xmmword ptr [rcx], xmm0
0.97 0.02 0.01 - - - - - - - - - - movdqa xmm0, xmmword ptr [r8 + 16]
0.49 0.01 0.50 - - - - - 0.33 0.34 - 0.33 - paddw xmm0, xmmword ptr [rdx + 16]
0.02 0.49 0.49 - - - - - - - - - - movdqa xmmword ptr [rcx + 16], xmm0
- - - 0.50 - - 0.50 - - - - - - ret
Timeline view:
0123456789 0
Index 0123456789 0123456789
[0,0] DeeeeeeeeER . . . . movdqa xmm0, xmmword ptr [r8]
[0,1] D=eeeeeeeeER . . . . paddw xmm0, xmmword ptr [rdx]
[0,2] D=========eER . . . . movdqa xmmword ptr [rcx], xmm0
[0,3] DeeeeeeeeE--R . . . . movdqa xmm0, xmmword ptr [r8 + 16]
[0,4] .DeeeeeeeeE-R . . . . paddw xmm0, xmmword ptr [rdx + 16]
[0,5] .D========eER . . . . movdqa xmmword ptr [rcx + 16], xmm0
[0,6] .DeE--------R . . . . ret
[1,0] . DeeeeeeeeER . . . . movdqa xmm0, xmmword ptr [r8]
[1,1] . D=eeeeeeeeER . . . . paddw xmm0, xmmword ptr [rdx]
[1,2] . D=========eER. . . . movdqa xmmword ptr [rcx], xmm0
[1,3] . DeeeeeeeeE--R. . . . movdqa xmm0, xmmword ptr [r8 + 16]
[1,4] . DeeeeeeeeE-R. . . . paddw xmm0, xmmword ptr [rdx + 16]
[1,5] . D========eER. . . . movdqa xmmword ptr [rcx + 16], xmm0
[1,6] . DeE--------R. . . . ret
[2,0] . DeeeeeeeeER. . . . movdqa xmm0, xmmword ptr [r8]
[2,1] . D=eeeeeeeeER . . . paddw xmm0, xmmword ptr [rdx]
[2,2] . D=========eER . . . movdqa xmmword ptr [rcx], xmm0
[2,3] . DeeeeeeeeE--R . . . movdqa xmm0, xmmword ptr [r8 + 16]
[2,4] . DeeeeeeeeE-R . . . paddw xmm0, xmmword ptr [rdx + 16]
[2,5] . D========eER . . . movdqa xmmword ptr [rcx + 16], xmm0
[2,6] . DeE--------R . . . ret
[3,0] . .DeeeeeeeeER . . . movdqa xmm0, xmmword ptr [r8]
[3,1] . .D=eeeeeeeeER . . . paddw xmm0, xmmword ptr [rdx]
[3,2] . .D=========eER . . . movdqa xmmword ptr [rcx], xmm0
[3,3] . .DeeeeeeeeE--R . . . movdqa xmm0, xmmword ptr [r8 + 16]
[3,4] . . DeeeeeeeeE-R . . . paddw xmm0, xmmword ptr [rdx + 16]
[3,5] . . D========eER . . . movdqa xmmword ptr [rcx + 16], xmm0
[3,6] . . DeE--------R . . . ret
[4,0] . . DeeeeeeeeER . . . movdqa xmm0, xmmword ptr [r8]
[4,1] . . D=eeeeeeeeER. . . paddw xmm0, xmmword ptr [rdx]
[4,2] . . D=========eER . . movdqa xmmword ptr [rcx], xmm0
[4,3] . . DeeeeeeeeE--R . . movdqa xmm0, xmmword ptr [r8 + 16]
[4,4] . . D=eeeeeeeeER . . paddw xmm0, xmmword ptr [rdx + 16]
[4,5] . . D=========eER . . movdqa xmmword ptr [rcx + 16], xmm0
[4,6] . . DeE---------R . . ret
[5,0] . . DeeeeeeeeE-R . . movdqa xmm0, xmmword ptr [r8]
[5,1] . . D=eeeeeeeeER . . paddw xmm0, xmmword ptr [rdx]
[5,2] . . D=========eER . . movdqa xmmword ptr [rcx], xmm0
[5,3] . . DeeeeeeeeE--R . . movdqa xmm0, xmmword ptr [r8 + 16]
[5,4] . . .D=eeeeeeeeER . . paddw xmm0, xmmword ptr [rdx + 16]
[5,5] . . .D=========eER . . movdqa xmmword ptr [rcx + 16], xmm0
[5,6] . . .DeE---------R . . ret
[6,0] . . . DeeeeeeeeE-R . . movdqa xmm0, xmmword ptr [r8]
[6,1] . . . D=eeeeeeeeER . . paddw xmm0, xmmword ptr [rdx]
[6,2] . . . D=========eER. . movdqa xmmword ptr [rcx], xmm0
[6,3] . . . DeeeeeeeeE--R. . movdqa xmm0, xmmword ptr [r8 + 16]
[6,4] . . . D=eeeeeeeeER. . paddw xmm0, xmmword ptr [rdx + 16]
[6,5] . . . D=========eER . movdqa xmmword ptr [rcx + 16], xmm0
[6,6] . . . DeE---------R . ret
[7,0] . . . DeeeeeeeeE-R . movdqa xmm0, xmmword ptr [r8]
[7,1] . . . D=eeeeeeeeER . paddw xmm0, xmmword ptr [rdx]
[7,2] . . . D=========eER . movdqa xmmword ptr [rcx], xmm0
[7,3] . . . DeeeeeeeeE--R . movdqa xmm0, xmmword ptr [r8 + 16]
[7,4] . . . D=eeeeeeeeER . paddw xmm0, xmmword ptr [rdx + 16]
[7,5] . . . D=========eER . movdqa xmmword ptr [rcx + 16], xmm0
[7,6] . . . DeE---------R . ret
[8,0] . . . .DeeeeeeeeE-R . movdqa xmm0, xmmword ptr [r8]
[8,1] . . . .D=eeeeeeeeER . paddw xmm0, xmmword ptr [rdx]
[8,2] . . . .D=========eER . movdqa xmmword ptr [rcx], xmm0
[8,3] . . . .DeeeeeeeeE--R . movdqa xmm0, xmmword ptr [r8 + 16]
[8,4] . . . . DeeeeeeeeE-R . paddw xmm0, xmmword ptr [rdx + 16]
[8,5] . . . . D========eER . movdqa xmmword ptr [rcx + 16], xmm0
[8,6] . . . . DeE--------R . ret
[9,0] . . . . DeeeeeeeeER . movdqa xmm0, xmmword ptr [r8]
[9,1] . . . . D=eeeeeeeeER. paddw xmm0, xmmword ptr [rdx]
[9,2] . . . . D=========eER movdqa xmmword ptr [rcx], xmm0
[9,3] . . . . DeeeeeeeeE--R movdqa xmm0, xmmword ptr [r8 + 16]
[9,4] . . . . DeeeeeeeeE-R paddw xmm0, xmmword ptr [rdx + 16]
[9,5] . . . . D========eER movdqa xmmword ptr [rcx + 16], xmm0
[9,6] . . . . DeE--------R ret
Average Wait times (based on the timeline view):
[0]: Executions
[1]: Average time spent waiting in a scheduler's queue
[2]: Average time spent waiting in a scheduler's queue while ready
[3]: Average time elapsed from WB until retire stage
[0] [1] [2] [3]
0. 10 1.0 1.0 0.4 movdqa xmm0, xmmword ptr [r8]
1. 10 2.0 0.0 0.0 paddw xmm0, xmmword ptr [rdx]
2. 10 10.0 0.0 0.0 movdqa xmmword ptr [rcx], xmm0
3. 10 1.0 1.0 2.0 movdqa xmm0, xmmword ptr [r8 + 16]
4. 10 1.4 0.4 0.6 paddw xmm0, xmmword ptr [rdx + 16]
5. 10 9.4 0.0 0.0 movdqa xmmword ptr [rcx + 16], xmm0
6. 10 1.0 1.0 8.4 ret
10 3.7 0.5 1.6 <total>
warning: found a return instruction in the input assembly sequence.
note: program counter updates are ignored.
At this point, I assume there's either something I'm missing or some type of quirk on my AMD Ryzen 5 3600 cpu that causes AVX2 instructions to be slower than I'm expecting.
Any thoughts/advice/insights are appreciated.