So I have been playing around a bit with the simd intrinsics and seem to have found something weird.
Debug mode is fine but whenever I run in release mode on my windows machine, the program almost instantly segfaults.
I have boiled it down to
println!("Setzero started");
let z = _mm256_setzero_ps();
println!("setzero done: {:?}, cast started", f32x8_to_array(z));
let i = _mm256_castps_si256(z); //<-- crash here unless
println!("cast done:"); //<--- this line is removed
println!(" {:?}", i32x8_to_array(i));
if i just remove the line println!("cast done:"); then everything works fine.
This works perfectly on my linux machine and on playground
Oh thank you very much! I had completely forgot about adding RUSTFLAGS='-C target-feature=+avx' . Guess I should add #[cfg(all(target_arch = "x86_64", target_feature = "avx"))] to the function signature then.
Just out of curiosity, why does intrisics compile at all without sufficient features enabled? Why would it not make sense to have the intrinsics them selves behind #[cfg] s?
Because you generally want to use runtime feature detection instead of compile time feature detection. Take a look at the docs: std::arch - Rust
Runtime feature detection means you can compile and ship portable binaries that will take advantage of CPU specific features without needing to compile specifically for that CPU.
over all your functions with avx2 stuff. Does not that mean that you would be able to put _mm256_add_epi64 behind a #[target_feature(enable = "avx2")] and you would get slightly more compiletime guidance because avx2 intrinsics would only be available in those contexts?
I am sorry if I am sounding stubborn and rude. Just wish to understand the thoughts behind the design desicions so I can make good ones myself in the future
I believe that considering desire for runtime detection it should be solved in a slightly different fashion, unfortunately this approach will require some overhaul of the building process, so I don't think that it will be implemented any time soon.
Sorry, I don't think I understand your question. The goal is to be able to compile CPU specific instructions into binaries that may run on CPUs that don't support those instructions. Programs must use CPU feature detection at runtime to dispatch between them.
What we have right now is approximately the minimum set of features needed to achieve that goal.
#[target_feature(enable = "avx2")] is not invoking conditional compilation. It is a directive applied to a function that tells the compiler to emit code with the given target feature enabled. It is then up to the caller to ensure that the target feature is enabled at runtime before calling that function. In particular, _mm256_add_epi64 is already behind #[target_feature(enable = "avx2")].
Then how am I able to compile my test with intrinsics at first without any target_feature enabled? If I understand correctly, items with #[cfg(target_feature = "...")] are only compiled if that feature is indeed enabled. Wouldn't that make the corresponding intrinsics unavailable when compiled without their features? Thus I would have expected some sort of error
error[E0425]: cannot find function `_mm256_castps_si256` in this scope
--> foo\bar.rs:y:x
|
y | _mm256_castps_si256();
| ^^^^^^^^^^^^^^^^^^^ not found in this scope
when compiling my initial example without RUSTFLAGS='-C target-feature=+avx'
Hmm, I don't think this example should crash. I agree you ideally want avx enabled in the caller for optimal use of the intrinsic, but it shouldn't crash provided your cpu supports it. The docs dont seem to say the caller needs avx enabled.
#[cfg(target_feature = "avx2")] uses conditional compilation and compiles the tagged code only when you've told the compiler to compile the code with the avx2 target feature enabled (or with a specific CPU that is known to support it).
#[target_feature(enable = "avx2")] is not conditional compilation, but is instead a directive for telling the compiler to compile a function with a specific target feature enabled, regardless of any compile time settings.
The intrinsics are tagged with the latter, not the former, so they are always available to call, regardless of your specific CPU and regardless of compile time settings.
Oh ok thanks, that makes sense. But then as @parched pointed out, shouldn't my first test work when run an AVX enabled CPU? Even though it might be very bad practice not to do the runtime check?
and it actually seems to have fixed to segfault. Although I am not sure why beacause f32 and i32 happens to have the same bit pattern for zero if I am not mistaken.
No, it's not clear to me that that is true. Without any compile time target features enabled and without any use of #[target_feature], your f function is not compiled for AVX2. However, it has AVX vectors in its stack frame and is calling other functions that require an AVX ABI. I'm not sure you have any guarantees here; it smells like UB to me.
Oh ok. Sorry for taking so long to get it. So even having an AVX vector and passing it arround in my non #[target_feature] is UB. However the insides of the intrinsics are fine bacause they have #[target_feature] on them.
Thank you all so much for taking the time answer all my annoying questions and to help me with this!
Note: even with the f32 -> i32 fix mentioned earlier and no more segfaults, the function sometimes prints garbage values. So it is clearly UB.
@newpavlov I read your proposal, I can not say that have enough knowledge to understand all of it. But if I understand it right it essentially makes it harder to do this kind of bad things? And easier to do the right thing? Seems nice!