I was playing around with computing some CRC checksums using the x86 CRC32 instruction, and I noticed something in that surprised me. The compiler does not seem to inline calls to the _mm_crc32_u8 intrinsic.
This strikes me as odd, as these are things that tend to be in hot loops, so any it seems like inlining would be ideal.
What’s the reason for this behavior?
the documentation of the intrinsic says it is only available safe to call with target feature sse4.2, which is not enabled for the default x86_64 target.
in short, you have two options:
annotate the function that calls the intrinsic with a #[target_feature(enable = "sse4.2")] attribute, e.g.:
that's how the #[target_feature(enable = "xxx")] attribute works. it is a per-function attribute which affects which features are enabled for the backend during codegen. however, calling such function requires unsafe if the caller does not have the same target features enabled, and will result in UB at runtime if the cpu didn't support the required feature.
note, it is different from a conditional compilation guard, which looks like #[cfg(target_feature = "yyy")]. I explained the difference in this previous post:
That's not the issue though. In our case there is the unsafe marker and the CPU does support the instruction. The problem is that the call isn't inlined and that the documentation seems wrong:
Available on (x86 or x86-64) and target feature sse4.2 and x86-64 only.
It says the function is not "available" which suggests the compiler will prevent you from calling it, and yet it doesn't. Perhaps the documentation should be changed from "available on [...] only" to something like "the function may cause UB unless [...] and will not get inlined unless [...]".
your cpu does support that, but you didn't tell rustc what your cpu is.
out of the box, rustc is configured conservatively [1] with the default target cpu for maximum compatibility.
if users want their program to be compiled for a specific cpu than the default one, it is up to the users to setup the correct compiler flags (or to annotate the source code with proper attributes).
the fact the call is not inlined is a consequence of the caller not having the required target feature enabled. I may be wrong, so don't quote me on this, but I think this is probably a limitation imposed by the LLVM codegen backend, not by rustc.
I agree the wording of the documentation can be improved. or at least, if should include a link to the documentation about the #[target_feature] attribute to give users more context.
personally, I would say it's over conservative, but it is what it is for now ↩︎
the fact the call is not inlined is a consequence of the caller not having the required target feature enabled. I may be wrong, so don't quote me on this, but I think this is probably a limitation imposed by the LLVM codegen backend, not by rustc.
Did a quick test. This is indeed what’s going on. If you go to the link I provided in the original post, and add -C target-cpu=znver5 (which supports all AVX extensions), then everything gets inlined.
That's odd that the compiler does not error out when compiling unsupported features, and instead does not inline. What's the reason for that?
I would say historic reasons. Intrinsics support in the language is quite important for various areas and many wanted to see it ASAP, which has resulted in stabilization of a somewhat un-Rusty feature. I would love to see something like this, but even the relatively modest target features v1.1 proposal took 7+ years to implement and stabilize, so I don't have high hopes for significant improvements in this area in the near future.
Yeap, LLVM is unable to inline functions with different target features.
you can detect target feature either at compile time, or at runtime, there are legitimate uses for both. let's take a hypothetical example.
suppose we have some algorithm foo, which can take advantage of special cpu instructions if available, but we also want our software to be working on hardware that lacks the special features, so we also have a slower version as a fallback.
there's two way we can deal with such use case:
select which version to call at compile time using conditional compilation:
#[cfg(target_feature = "xxx")]
fn foo() {
// probably use intrinsics or inline assembly
}
#[cfg(not(target_feature = "xxx"))]
fn foo() {
// no hardware acceleration, software emulated
}
fn main() {
// note we don't need `unsafe`
foo();
}
pros:
smaller binary, no runtime overhead;
cons:
need to produce multiple variants of binary files, and may crash at runtime if distributed to incompatible systems.
include both version in the program and select the suitable one to call at runtime:
#[target_feature(enable = "xxx")]
fn foo_accelerated() {
// probably use intrinsics or inline assembly
}
fn foo_fallback() {
// no hardware acceleration, software emulated
}
fn main() {
if is_x86_feature_detected("xxx") {
// even if the function itself is not `unsafe fn`, still need `unsafe` to call,
// because it's UB if the feature is actually unavailable
// SAFETY: guarded with runtime feature detection based on `CPUID`
unsafe { foo_accelerated() };
} else {
foo_fallback();
}
}
pros:
a single binary can be distributed to, and is compatible with, mutiple different cpu models
cons:
binary size is inevitably larger, (in theory) may have some runtime overhead
as for the reason why the call to a #[target_feature()] annotated function is not inlined, it's a limitation of LLVM, I'm pretty sure there's technical reasons, but I'm not into LLVM to know better. from the perspective of rustc, despite the requirement of an unsafe block, it's no different from a regular function call, it just emits a regular call operation in LLVM IR.
if you want a compile time error for unavailable intrinsics, the standard library would need to be conditional compiled (i.e. #[cfg(target_feature = "xxx")] instead of #[target_feature(enable = "xxx")] on the intrinsics). but the problem is, rust then would have to ship gazillions of prebuilt standard libraries for different target feature sets, which would lead to a combinatorial explosion (imagine libcore-sse, libcore-sse2, libcore-avx, libcore-avx2, libcore-sse,avx, libcore-sse2,avx, and so on and so on); or we have to give up pre-built libraries and just compile the standard library from source every time.