Disable unsupported SIMD code generation

jhunterkohler · July 6, 2025, 5:26am

Problem

I've written some SIMD intrinsics code targeting x86-64: SSE4.1, AVX, and AVX2. I began reading some of the assembly output, and came across a troubling result. When I made the mistake of not explicitly enabling AVX2 but only AVX, the compiler generated weird code sequences for AVX2 intrinsics.

This is very problematic since code like this may go unnoticed unless the actual assembly is inspected with every version. Is there any way to disable or protect against this code generation?

Example

See that no_avx2 below still compiles, but to a very slow version of yes_avx2. One expects _mm256_and_si256 to compile to nearly a single vandps instruction.

Code:

use std::arch::x86_64::*;

#[target_feature(enable = "avx2")]
pub unsafe fn yes_avx2(a: __m256i, b: __m256i) -> __m256i {
    _mm256_and_si256(a, b)
}

#[target_feature(enable = "avx")]
pub unsafe fn no_avx2(a: __m256i, b: __m256i) -> __m256i {
    _mm256_and_si256(a, b)
}

Assembly (compiler explorer with -C opt-level=3):

core::core_arch::x86::avx2::_mm256_and_si256::hda082db2f6c5855a:
        vmovaps (%rdx), %ymm0
        vandps  (%rsi), %ymm0, %ymm0
        vmovaps %ymm0, (%rdi)
        vzeroupper
        retq

example::no_avx2::h5e8ad84ba7048fcb:
        pushq   %rbp
        movq    %rsp, %rbp
        pushq   %rbx
        andq    $-32, %rsp
        subq    $96, %rsp
        movq    %rdi, %rbx
        vmovaps (%rsi), %ymm0
        vmovaps %ymm0, (%rsp)
        vmovaps (%rdx), %ymm0
        vmovaps %ymm0, 32(%rsp)
        movq    %rsp, %rsi
        leaq    32(%rsp), %rdx
        vzeroupper
        callq   core::core_arch::x86::avx2::_mm256_and_si256::hda082db2f6c5855a
        movq    %rbx, %rax
        leaq    -8(%rbp), %rsp
        popq    %rbx
        popq    %rbp
        retq

example::yes_avx2::ha14dc1d4af094302:
        movq    %rdi, %rax
        vmovaps (%rdx), %ymm0
        vandps  (%rsi), %ymm0, %ymm0
        vmovaps %ymm0, (%rdi)
        vzeroupper
        retq

nerditation · July 6, 2025, 5:59am

note, #[target_feature(enable = "xxx")] is NOT a conditional compilation flag. maybe you meant #[cfg(target_feature = "avx2")]?

what the #[target_feature] attribute does is that it tells the backend that this function can use special instructions when generating machine code.

in rust, functions with this attributes must be unsafe, and it is UB to call it from context that does not have the target feature.

in other words, either the calling function must also enable the same target feature, or the caller must do a runtime check before calling it.

_mm256_and_si256() is an avx2 intrinsic that would ALWAYS be compiled down to vandps, it is NOT a "portable" simd routine that gets conditionally compiled to vandps on avx2, and fallback to software emulation otherwise.

in your example, the the two functions differents only in that yes_avx2() inlined _mm256_and_si256(), while no_avx2() didn't. instead it called the function in libcore.

so your no_avx2() contains UB unless you enabled the -C target-feature=+avx2 compiler flag, in which case it would generate exactly the same code as yes_avx2().

unfortunately, portable SIMD has not landed on stable yet. see:

github.com/rust-lang/portable-simd

Critical issues before stabilization

opened 04:41PM - 10 Sep 23 UTC

calebzulawski

I wanted to put together a more technical list of issues to be solved before we …can stabilize (see rust-lang/rust#86656). There are many more important issues like performance on some targets, or important missing functions, but IMO those don't prevent stabilization. In this list I've trimmed down the issues to major API-related issues that can't be changed, or are difficult to change, after stabilization. ### Issues to be solved - **Restricting the number of lanes**. In my opinion, the `LaneCount<N>: SupportedLaneCount` bound makes the API exceptionally cumbersome. I've also had some trouble writing generics around it, particularly when making functions that change the number of lanes (the trait solver sometimes simply fails). Adding the bound often feels like boilerplate, and I found myself looking for ways to launder the bound, like adding unnecessary const parameters. Making this a post-monomorphization error (I found rust-lang/lang-team#195 helpful) might be the way to go, or perhaps there's a way to make a const fn panic. Cons: the trait bound is very explicit and hiding the error states could possibly do more harm than good when identifying the source of a build failure. - **Mask element type**. I'm not confident that the mask type for `Simd<f32, N>` should be `Simd<i32, N>` (see rust-lang/portable-simd#322 for discussion). I think it would be much more straightforward if the types simply matched. Cons: this would require an extra cast when using `i32` masks for `f32` vectors or similar, and makes implementing `From<Mask> for Mask` impossible (since pointer masks must be generic `Mask<*const T, N>`, maybe all pointers could use the same mask element?). - **Swizzle functions**. They're difficult to use and not a very good API, but Rust doesn't currently allow for anything much better. We could hold off stabilizing arbitrary swizzles, but that's a big limitation. ### Non-issues, but things that should be done - Are we happy with the traits the way they are? e.g. `SimdPartialEq`, `SimdInt`. - The API should be partitioned into multiple unstable features that can be stabilized independently. ### Updates (after filing this issue) #### Lane count - I tried condensing `LaneCount` an `SimdElement` into a single bound `Simd<T, N>: Supported`. This doesn't work well for a variety of reasons. One example: scatter/gather use `Simd<T, N>`, `Simd<*const T, N>`, and `Simd<usize, N>`. Each of these would need their own bound, rather than using one `LaneCount` bound since they all share `N`. - I recommend either keeping the bounds as they are now (`LaneCount` and `SimdElement`) or turning `LaneCount` into a non-local error. #### Swizzles I tried the following swizzle code. It requires the incomplete `adt_const_params` feature. Even with this feature enabled, it's impossible to implement functions like `reverse`, because const generics can't access the generic const parameters. <details> ```rust pub fn swizzle<const M: usize, const INDEX: &'static [usize]>(self) -> Simd<T, M> where LaneCount<M>: SupportedLaneCount, { // SAFETY: `self` is a vector and the swizzle index is a const array of u32 unsafe { intrinsics::simd_shuffle( self, self, const { assert!(M == INDEX.len(), "`M` must equal the length of `INDEX`"); let mut r = [0; M]; let mut i = 0; while i < M { assert!( INDEX[i] < N, "indices must be less than the input vector length" ); r[i] = INDEX[i] as u32; i += 1; } r }, ) } } ``` </details>

system · October 4, 2025, 6:00am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Library requiring simd help	5	537	September 3, 2019
Random segfaults using simd intrinsics	21	1404	January 12, 2023
How to get good code-gen with core::arch intrinsics? help	34	1590	August 28, 2022
What code does packed_simd generate for the unsupported vector width? help	7	638	January 8, 2021
Rust auto-vectorization is 9000% slower help	5	397	July 17, 2025

Disable unsupported SIMD code generation

Problem

Example

Related topics