Random segfaults using simd intrinsics

Hello!

So I have been playing around a bit with the simd intrinsics and seem to have found something weird.

Debug mode is fine but whenever I run in release mode on my windows machine, the program almost instantly segfaults.

I have boiled it down to

println!("Setzero started");
let z = _mm256_setzero_ps();
println!("setzero done: {:?}, cast started", f32x8_to_array(z));
let i = _mm256_castps_si256(z);                     //<-- crash here unless
println!("cast done:");                             //<--- this line is removed
println!("   {:?}", i32x8_to_array(i));

if i just remove the line println!("cast done:"); then everything works fine.
This works perfectly on my linux machine and on playground

$ rustc --version
rustc 1.29.1 (b801ae664 2018-09-20)

$ rustc --version
rustc 1.31.0-nightly (77af31408 2018-10-11)

Windows machine with an Intel i5 3427U CPU which is supposed to support AVX

Is this most likely something wrong with my setup or could it be a compiler bug or is it just me doing something terribly wrong promming-wise?

Edit: oh sorry, forgot to mention the "crash" is a segfault

There was an issue I remember and it helped to add

#[cfg(target_arch = "x86_64")]

to the top of the function. Give it a try

Thank you but it does not seem to make any difference.

I put it on top on f() like

#[cfg(target_arch = "x86_64")]
fn f(){

still crashes unless println!("cast done:"); is removed

You're going to want to give this section of the docs a pretty thorough read: std::arch - Rust

In particular, I don't see any use of #[target_feature(enable = "avx2")], nor have you told us whether you're compiling with target features enabled.

you i32x8_to_array function is wrong btw (not sure if that is the problem?!)

unsafe fn i32x8_to_array(x: __m256i) -> [f32; 8] 

should be

unsafe fn i32x8_to_array(x: __m256i) -> [i32; 8]

(i32 instead of f32)

Oh thank you very much! I had completely forgot about adding RUSTFLAGS='-C target-feature=+avx' . Guess I should add #[cfg(all(target_arch = "x86_64", target_feature = "avx"))] to the function signature then.

Oh nice catch! The dangers of copy paste :stuck_out_tongue: .

Just out of curiosity, why does intrisics compile at all without sufficient features enabled? Why would it not make sense to have the intrinsics them selves behind #[cfg] s?

1 Like

Because you generally want to use runtime feature detection instead of compile time feature detection. Take a look at the docs: std::arch - Rust

Runtime feature detection means you can compile and ship portable binaries that will take advantage of CPU specific features without needing to compile specifically for that CPU.

1 Like

Correct me if I am wrong but if you, like in the example, put

#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
#[target_feature(enable = "avx2")]

over all your functions with avx2 stuff. Does not that mean that you would be able to put _mm256_add_epi64 behind a #[target_feature(enable = "avx2")] and you would get slightly more compiletime guidance because avx2 intrinsics would only be available in those contexts?

I am sorry if I am sounding stubborn and rude. Just wish to understand the thoughts behind the design desicions so I can make good ones myself in the future :slight_smile:

I believe that considering desire for runtime detection it should be solved in a slightly different fashion, unfortunately this approach will require some overhaul of the building process, so I don't think that it will be implemented any time soon.

Sorry, I don't think I understand your question. The goal is to be able to compile CPU specific instructions into binaries that may run on CPUs that don't support those instructions. Programs must use CPU feature detection at runtime to dispatch between them.

What we have right now is approximately the minimum set of features needed to achieve that goal.

#[target_feature(enable = "avx2")] is not invoking conditional compilation. It is a directive applied to a function that tells the compiler to emit code with the given target feature enabled. It is then up to the caller to ensure that the target feature is enabled at runtime before calling that function. In particular, _mm256_add_epi64 is already behind #[target_feature(enable = "avx2")].

Then how am I able to compile my test with intrinsics at first without any target_feature enabled? If I understand correctly, items with #[cfg(target_feature = "...")] are only compiled if that feature is indeed enabled. Wouldn't that make the corresponding intrinsics unavailable when compiled without their features? Thus I would have expected some sort of error

error[E0425]: cannot find function `_mm256_castps_si256` in this scope
  --> foo\bar.rs:y:x
   |
y  |     _mm256_castps_si256();
   |     ^^^^^^^^^^^^^^^^^^^ not found in this scope

when compiling my initial example without RUSTFLAGS='-C target-feature=+avx'

I hope this clarifies my question :slight_smile: .

Hmm, I don't think this example should crash. I agree you ideally want avx enabled in the caller for optimal use of the intrinsic, but it shouldn't crash provided your cpu supports it. The docs dont seem to say the caller needs avx enabled.

#[cfg(target_feature = "avx2")] uses conditional compilation and compiles the tagged code only when you've told the compiler to compile the code with the avx2 target feature enabled (or with a specific CPU that is known to support it).

#[target_feature(enable = "avx2")] is not conditional compilation, but is instead a directive for telling the compiler to compile a function with a specific target feature enabled, regardless of any compile time settings.

The intrinsics are tagged with the latter, not the former, so they are always available to call, regardless of your specific CPU and regardless of compile time settings.

Oh ok thanks, that makes sense. But then as @parched pointed out, shouldn't my first test work when run an AVX enabled CPU? Even though it might be very bad practice not to do the runtime check?

I finally fixed the type error

and it actually seems to have fixed to segfault. Although I am not sure why beacause f32 and i32 happens to have the same bit pattern for zero if I am not mistaken.

No, it's not clear to me that that is true. Without any compile time target features enabled and without any use of #[target_feature], your f function is not compiled for AVX2. However, it has AVX vectors in its stack frame and is calling other functions that require an AVX ABI. I'm not sure you have any guarantees here; it smells like UB to me.

Oh ok. Sorry for taking so long to get it. So even having an AVX vector and passing it arround in my non #[target_feature] is UB. However the insides of the intrinsics are fine bacause they have #[target_feature] on them.

Thank you all so much for taking the time answer all my annoying questions and to help me with this!:slight_smile:

Note: even with the f32 -> i32 fix mentioned earlier and no more segfaults, the function sometimes prints garbage values. So it is clearly UB.

1 Like

@newpavlov I read your proposal, I can not say that have enough knowledge to understand all of it. But if I understand it right it essentially makes it harder to do this kind of bad things? And easier to do the right thing? Seems nice!