Why does this (unsafe) code sigkill?

I was trying to write some unsafe code to access out-of-bounds memory. Here's the code:

 fn main() {
     let a: Option<[char; 1024]> = None;
     let b = unsafe {a.unwrap_unchecked()};

     println!("{:?}", b);
 }

The idea was to run this on Rust Playground so I could see what's going on behind the scenes. That didn't work, so I figured they had some kind of antivirus running and I tried running it on my own machine (Linux, if that matters). I got the error: fish: Job 1, 'cargo run' terminated by signal SIGILL (Illegal instruction). Is there no way to do this, even with unsafe Rust?

Well, since None.unwrap_unchecked() is insta-UB, compiler is free to replace it with anything, including explicit ud2 instruction. In playground in release mode it does literally that - if you select "Show Assembly", you'll see that playground::main consists only of a single ud2.

14 Likes

The compiler inlined the function call into this:

fn main() {
    let a: Option<[char; 1024]> = None;
    let b = match a {
        Some(value) => value,
        None => unsafe { std::hint::unreachable_unchecked() },
    };

    println!("{:?}", b);
}

Then since it always takes the None branch, that optimizes into this:

fn main() {
    unsafe { std::hint::unreachable_unchecked() }
}
9 Likes

You can think UB as a runtime condition of syntax error. What happens if you compile some Rust code which contains some syntax error and run it? The answer is it doesn't make sense. If the program ever touches UB at runtime it's not a valid Rust program. It doesn't make sense to expect reasonable behavior in this case.

Would there be any way to prevent the compiler from optimizing that out in order to see what the compiler would do in this case?

it's not a "would". This is what the compiler does. Since it's UB, it's meaningless anyway, so you shouldn't expect it to work, and you shouldn't expect any particular machine code to be generated, either.

Perhaps you wanted to see what the compiler stores in that memory? You can do that like this:

unsafe fn print_bytes<T: ?Sized>(val: &T) {
    let len = std::mem::size_of_val(val);
    println!("{} bytes", len);
    let slice = std::slice::from_raw_parts(val as *const T as *const u8, len);
    println!("{:?}", slice);
}

fn main() {
    let none: Option<[char; 4]> = None;
    unsafe {
        print_bytes(&none);
    }
}

For me it prints this:

16 bytes
[0, 0, 17, 0, 253, 127, 0, 0, 0, 208, 3, 60, 253, 127, 0, 0]

Here the compiler is making use of the fact that not all bit-patterns are valid for the char type because it must be a unicode scalar value. For example, the value 0x11000 is invalid:

let test = '\u{110000}';
error: invalid unicode character escape
 --> src/main.rs:2:17
  |
2 |     let test = '\u{110000}';
  |                 ^^^^^^^^^^ invalid escape
  |
  = help: unicode escape must be at most 10FFFF

The value 0x11000 corresponds to the bytes [0, 0, 17, 0]. Using this fact, the compiler stores a None as an array where the first char is 0x11000 and allows any value for the remaining three chars.

5 Likes

Note however that this last code is UB too, since padding bytes (i.e. anything other then the discriminant, in this case) are uninitialized, and reading uninitialized bytes is UB.

2 Likes

I like to think of undefined behavior as being like division by zero in math: Once it happens, anything can follow, including madness such as proving that 1 = 2. Asking what would happen if you could divide by zero without madness ensuing doesn't make sense, as the madness is what division by zero does. Same with undefined behavior.

2 Likes

No. All optimisations assume, unconditionally, that your program never triggers UB. Even the ones which happen with -O0.

It's your responsibility to ensure that program wouldn't trigger UB. Sometimes you may write code which is too clever for the compiler to recognize as UB (like @alice did), but you should always remember that you are playing with fire, when you are doing that.

It maybe an Ok way to investigate compiler work on some examples, but keep in mind that tomorrow compiler can become more clever and would find out that you are triggering UB. That's why @alice said for me it does this. Because tomorrow, for you, it may do something else entirely.

10 Likes

It's not useful to think of optimization as a separate process from compilation, these days. Translation, dead code elimination, equivalent sequence substitution, reordering, and other processes all proceed hand in hand. It's better to think of the different -O flags as different translation strategies, rather than as increasing or decreasing "optimal" code generation.

So long as the compiler produces a program that is equivalent to the input, in the target language, it has done its job and the result is "what the compiler would do," even if it's not what a rote statement by statement translation would be. Your input program triggers undefined behaviour. The entire compile process is designed with the assumption that that doesn't happen, and it's the one situation where the compiler is excused from generating a program with equivalent semantics (since there aren't any). There isn't really a way to inspect what the compiler will do with undefined behaviour under any assumption that requires the UB to have definite results, since UB, very explicitly, does not have definite results.

The specific UB your program contains happens to be detectable at compile time, so in principle the compiler could reject it. That's a quality-of-implementation issue rather than a correctness issue, though, and not every kind of UB can be detected that way.

4 Likes

One way is it to explicitly block inlining, to mask insta-UB from optimizer:

#[inline(never)]
fn q(a : Option<[char; 1024]>) {
     let b = unsafe {a.unwrap_unchecked()};
     println!("{:?}", b);
} 

fn main() {
     let a: Option<[char; 1024]> = None;
     q(a);
 }
⣿
Standard Error

   Compiling playground v0.0.1 (/playground)
    Finished release [optimized] target(s) in 1.63s
     Running `target/release/playground`
thread 'main' panicked at 'index out of bounds: the len is 32 but the index is 32', library/core/src/unicode/unicode_data.rs:75:40
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Standard Output

['\u{110000}', '\0', '\u{310}', '\0', '\u{310}', '\0', '\u{8}', '\0', '\u{3}', '\u{4}', '\u{1be6a0}', '\0', '\u{1be6a0}', '\0', '\u{1be6a0}', '\0', '\u{1c}', '\0', '\u{1c}', '\0', '\u{10}', '\0', '\u{1}', '\u{4}', '\0', '\0', '\0', '\0', '\0', '\0', '𡓨', '\0', '𡓨', '\0', 'က', '\0', '\u{1}', '\u{5}', '𢀀', '\0', '𢀀', '\0', '𢀀', '\0', '\u{177624}', '\0', '\u{177624}', '\0', 'က', '\0', '\u{1}', '\u{4}', '\u{19a000}', '\0', '\u{19a000}', '\0', '\u{19a000}', '\0', '\u{4d2c4}', '\0', '\u{4d2c4}', '\0', 'က', '\0', '\u{1}', '\u{6}', '\u{1e7788}', '\0', '\u{1e8788}', '\0', '\u{1e8788}', '\0', '倘', '\0', '軘', '\0', 'က', '\0', '\u{2}', '\u{6}', '\u{1eab80}', '\0', '\u{1ebb80}', '\0', '\u{1ebb80}', '\0', 'Ǡ', '\0', 'Ǡ', '\0', '\u{8}', '\0', '\u{4}', '\u{4}', '\u{350}', '\0', '\u{350}', '\0', '\u{350}', '\0', ' ', '\0', ' ', '\0', '\u{8}', '\0', '\u{4}', '\u{4}', 'Ͱ', '\0', 'Ͱ', '\0', 'Ͱ', '\0', 'D', '\0', 'D', '\0', '\u{4}', '\0', '\u{7}', '\u{4}', '\u{1e7788}', '\0', '\u{1e8788}', '\0', '\u{1e8788}', '\0', '\u{10}', '\0', '\u{90}', '\0', '\u{8}', '\0', '\u{6474e553}', '\u{4}', '\u{350}', '\0', '\u{350}', '\0', '\u{350}', '\0', ' ', '\0', ' ', '\0', '\u{8}', '\0', '\u{6474e550}', '\u{4}', '\u{1be6bc}', '\0', '\u{1be6bc}', '\0', '\u{1be6bc}', '\0', '廔', '\0', '廔', '\0', '\u{4}', '\0', '\u{6474e551}', '\u{6}', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\u{10}', '\0', '\u{6474e552}', '\u{4}', '\u{1e7788}', '\0', '\u{1e8788}', '\0', '\u{1e8788}', '\0', '㡸', '\0', '㡸', '\0', '\u{1}', '\0', '\u{b12d7120}', '翽', '\u{eb528e97}', '罼', '\u{1e7788}', '\0', '\u{1e8788}', '\0', '\u{eb4d7660}', '罼', '\u{10}', '\u{3}', '
1 Like

Even with #[inline(never)] LLVM can propagate argument values into the function if all calls use the same value. This would unmask the UB again. If you change the type to Option<[char; 1]> this is exactly what LLVM does: Compiler Explorer The saving grace in case of Option<[char; 1024]> is that it is passed by-ref in the rust abi and LLVM doesn't know that it is ok to change the address of a. If the rust abi were to be changed to pass this type by-val, you would again get an ud2 instruction. This change is completely acceptable.

1 Like