Problem
I've written some SIMD intrinsics code targeting x86-64: SSE4.1, AVX, and AVX2. I began reading some of the assembly output, and came across a troubling result. When I made the mistake of not explicitly enabling AVX2 but only AVX, the compiler generated weird code sequences for AVX2 intrinsics.
This is very problematic since code like this may go unnoticed unless the actual assembly is inspected with every version. Is there any way to disable or protect against this code generation?
Example
See that no_avx2
below still compiles, but to a very slow version of yes_avx2
. One expects _mm256_and_si256
to compile to nearly a single vandps
instruction.
Code:
use std::arch::x86_64::*;
#[target_feature(enable = "avx2")]
pub unsafe fn yes_avx2(a: __m256i, b: __m256i) -> __m256i {
_mm256_and_si256(a, b)
}
#[target_feature(enable = "avx")]
pub unsafe fn no_avx2(a: __m256i, b: __m256i) -> __m256i {
_mm256_and_si256(a, b)
}
Assembly (compiler explorer with -C opt-level=3
):
core::core_arch::x86::avx2::_mm256_and_si256::hda082db2f6c5855a:
vmovaps (%rdx), %ymm0
vandps (%rsi), %ymm0, %ymm0
vmovaps %ymm0, (%rdi)
vzeroupper
retq
example::no_avx2::h5e8ad84ba7048fcb:
pushq %rbp
movq %rsp, %rbp
pushq %rbx
andq $-32, %rsp
subq $96, %rsp
movq %rdi, %rbx
vmovaps (%rsi), %ymm0
vmovaps %ymm0, (%rsp)
vmovaps (%rdx), %ymm0
vmovaps %ymm0, 32(%rsp)
movq %rsp, %rsi
leaq 32(%rsp), %rdx
vzeroupper
callq core::core_arch::x86::avx2::_mm256_and_si256::hda082db2f6c5855a
movq %rbx, %rax
leaq -8(%rbp), %rsp
popq %rbx
popq %rbp
retq
example::yes_avx2::ha14dc1d4af094302:
movq %rdi, %rax
vmovaps (%rdx), %ymm0
vandps (%rsi), %ymm0, %ymm0
vmovaps %ymm0, (%rdi)
vzeroupper
retq