How to get good code-gen with core::arch intrinsics?

what's the point of having intrinsics at all then?
with precise, i also mean: more precise than core::simd, because imo that's unusable for high performance code.
you know, the kind of precise that you'd expect from a reliable compiler. no nasty surprises in the disassembly.

They are much safer to work with and don't block optimizations. The compiler can't peek into inline assembly and thus can't do for example constant propagation and has to pessimize other optimizations due to not being able to make assumptions about what the inline assembly does. The compiler is also much better in scheduling instructions than end users most of the time. Especially across multiple cpu's.

1 Like

yes and that's why i don't use inline assembly.
because i do want constant folding, common sub-expression eliminiation, etc. never mind abstraction across multiple platforms.

core::simd and even core::arch have just been full of performance pitfalls, which really is quite frustrating.
like generating a call for unsupported intrinsics (which is easily 5-10x slower) instead of raising a compile error. to me that seems like a really weird decision, which probably wasn't thought about a lot.
you don't use intrinsics, unless you care about performance & control, so maybe the compiler should prioritize those things when it comes to intrinsics.

we don't really disagree here. i want to use intrinsics, because then i can have the compiler do more of the heavy lifting. but the rustc team and i have a very different understanding of what "intrinsic" means, apparently.

core::arch is a 1-to-1 copy of the equivalent C vendor intrinsics by design. Internally we just forward to the equivalent LLVM intrinsics that clang also uses for implementing these C vendor intrinsics. Some of which are emulated using the available instructions if the optimal instruction is not available due to a target feature not being enabled. We don't have any control over this.

2 Likes

like i said, this is terrible for performance reliability. and imo definitely misses the point of using intrinsics in the first place.
but i digress. let's focus on the factual matter:


1

clang is much saner and raises an error when using unsupported intrinsics: Compiler Explorer
using _m256_add_ps on a non-avx platform raises: "error: always_inline function '_mm256_add_ps' requires target feature 'avx', but would be inlined into function 'foo' that is compiled without support for 'avx'"


2

this is not true: Rust Playground
a) showing "llvm-ir" in release, you can clearly see that rustc generates an emulation function for _m256_add_ps for the avx fn.

; core::core_arch::x86::avx::_mm256_add_ps
; Function Attrs: inlinehint mustprogress nofree norecurse nosync nounwind nonlazybind uwtable willreturn
define internal fastcc void @_ZN4core9core_arch3x863avx13_mm256_add_ps17hc15a62e1c651e524E(<8 x float>* noalias nocapture noundef writeonly dereferenceable(32) %0, <8 x float>* noalias nocapture noundef readonly dereferenceable(32) %a, <8 x float>* noalias nocapture noundef readonly dereferenceable(32) %b) unnamed_addr #0 {
start:
  %_3 = load <8 x float>, <8 x float>* %a, align 32
  %_4 = load <8 x float>, <8 x float>* %b, align 32
  %1 = fadd <8 x float> %_3, %_4
  store <8 x float> %1, <8 x float>* %0, align 32
  ret void
}

b) furthermore, the f32x2 function shows that rustc doesn't map directly to intrinsics. _mm_add_ps gets mapped to fadd <2 x float>. note the 2 x - _mm_add_ps operates on four floats! this only happens because i've added enough stuff for the compiler to understand that this is only a two wide addition. if i write it slightly differently, it generates a 4 wide add (fadd <4 x float>), but still no _mm_add_ps.

c) but rustc can map directly to llvm intrinsics: _mm_min_ps in sane maps to @llvm.x86.sse.min.ps.


to me this suggests that you do very much have control over this.

actually, debug mode in the playground is even more insightful: it looks like every "intrinsic" gets a function.
those functions seem to determine whether llvm sse intrinsics or llvm vector intrinsics are used.
and the _mm_add_ps wrapper is actually 4 wide. -- but doesn't use sse's add_ps!
so rustc's release mode inlines those "functions" and optimizes them.

unless "llvm-ir" doesn't mean rustc's output, but rather means "llvm's output after the last optimization pass".
this seems unlikely though. i'd expect rustc to do vectorization at mir level, because it has way more knowledge of the program than llvm. but i haven't looked at the compiler's source code yet.

no, rustc doesn't appear to do optimization. or at least that's not what's causing these "things".
here are some things i've found:

  • the functions in the llvm-ir come from the definitions in stdarch. each "intrinsic" is actually a function, which wraps an llvm intrinsic.
  • the functions are marked as #[inline], not #[inline(always)], which explains why the emulation functions don't get inlined.
  • _mm_add_ps has the odd behavior of mapping to 2 x f32, because the wrapper links to the simd_add platform intrinsic, which presumably lowers to llvm's addition somehow.

and the interesting part:

  • the "normal" intrinsics map directly to llvm.x86.sse.*

so it appears to be "true" that you don't have control over the emulation. but i find that very hard to believe.
i'd rather expect this to be an optional feature by llvm, which has been enabled to get soft floats on non-fpu platforms, popcnt on non-popcnt platforms, etc.

rustc does very little optimization at the MIR level.

Yes, when you ask rustc to output LLVM ir, you're getting the ir after LLVM has done its passes on it. There isn't currently a way to ask the playground to emit the LLVM ir that rustc passes to LLVM directly. (I think there's a flag to rustc to dump LLVM ir that does do this, though.)


Additionally, IIRC using Rust's intrinsics when they aren't available on the target CPU is just UB. Calling a function with a different set of target features enabled is just UB.

2 Likes

ok, here's the summary:

  • the f32x2 problem:
    • _mm_undefined_pd seems like a safe solution for x86_64.
    • aarch64 apparently has two wide vectors.
    • wasm will have to live with the MaybeUninit UB solution, i guess.
    • i don't target anything else.
  • the intrinsics problem:
    • it's a bit sad that using unsupported intrinsics is "just UB", because it's very easy to forget to specify the target arch. maybe at some point, rustc can have error messages for this problem like clang.

just looked more closely at those "emulation functions".
turns out, they're not actually emulating the instruction.
they are just the wrapper function from core::arch::x86_64, which doesn't get inlined when you don't specify the target. - which is fine, because it's UB, but still kinda weird.

Remember that rustc can only emit what LLVM actually defines. That's no _m256_add_ps in LLVM Language Reference Manual โ€” LLVM 16.0.0git documentation, so rustc can't emit that directly in the IR.

This is an extremely common pattern. For example, x.is_power_of_two() on a NonZero type shows up in LLVM IR as x.count_ones() < 2, but then gets emitted using BLSR -- not POPCNT.

If LLVM assembly isn't turning out how you want, file an issue at Sign in to GitHub ยท GitHub

1 Like

@scottmcm @bjorn3

ok, i figured out what's going on with the intrinsics:

  • the "emulation function" is in fact just the wrapper from core::arch::x86::sse3:
#[inline]
#[target_feature(enable = "sse3")]
pub unsafe fn _mm_hadd_ps(a: __m128, b: __m128) -> __m128 {
    haddps(a, b)
}
  • as you can see, it is #[inline], not #[inline(always)], as you might expect. that's because #[target_feature] fns can't be inline always (because that would make the callers #[target_feature] too, which is probably unsound).
  • that means when you call the wrapper from a non-target-feature function, llvm can't inline it, because the caller doesn't have the target feature!
  • so the observed behavior is the correct behavior, but not the desired behavior.

i tried to fix that, but wasn't able to. i thought, maybe making it #[inline(always)] will trigger a compile error for invalid callers.
but since that's incompatible with target-feature, i had to turn target-feature into #[cfg(target_feature = "sse3")]
and then the problem was that _mm_hadd_ps wasn't visible anymore, because the core::arch module (my test module actually) wasn't compiled with the target feature.

so i'd say this is a flaw in the language.
my suggestion would be to allow #[inline(always)] on #[target_feature] fns, but require all target featurs to also be enabled in the caller.
you just move the "unsafe to call" back a level into the user functions (or require #[cfg] or crate level target features) -- just like clang does it!

  • this would raise compile time errors when using intrinsics in incompatible callers.
  • having to put #[target_feature] on your own fns makes it more likely that you'll think about the correctness of calling the fn.
  • you can use #[cfg] on user fns and get completely safe use of (simd) intrinsics. -- basically removing the need for the safe_arch crate.
  • you can still call "statically incompatible" (non-inline-always) functions at runtime, based on dynamic feature detection.

should i post a discussion on the rust internals forum? How to make core::arch simd intrinsics safe: - language design - Rust Internals

That makes runtime target feature detection impossible.

this example demonstrates how static AND dynamic dispatch work with the "clang solution"

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.