Just how unsafe/undefined is reading arbitrary data?

I was wondering just how unsafe the following code really is:

fn get_raw_bytes<T: Sized>(value: &T) -> &[u8]
{
    let length = std::mem::size_of::<T>();
    let bytes: &[u8] = unsafe {
        std::slice::from_raw_parts(value as *const _ as *const u8, length) };
    bytes
//     println!("dump: {:02x?}", bytes);
}

Of course, its purpose is to let me dump random blobs of data so I can figure out what's going on (typically aimed at C FFI stuff). From a C programmer's perspective, there's obviously nothing that can go wrong (I hear a voice from half-life here), but I'm aware of a lot of discussion of the evils of reading from uninitialised data leading to "undefined behaviour".

So my question: just how dangerous can this code be? Just what possibly can go wrong?

1 Like
  1. When you invoke undefined behaviour, everything can go wrong. It is that simple.
  2. This will be always UB, if T has any padding bytes.

I now wonder, similar to how criterion has a black_box function where the compiler can't optimize away stuff, could we make a black_box function which doesn't let the compiler inspect the function, but we're actually just returning a pointer to bytes?

After all, if the UB-ness of reading padding bytes is the fact that the compiler knows they're uninitialized, then couldn't we just turn around and tell the compiler that this is a completely random *const u8 that we managed to pull out of thin air and that we just so happen to read std::mem::size_of::<T>() bytes from it? After all, if I tell the compiler that we must trust that the bytes under the address are initialized, like one would assume the values under any array are, then we cannot invoke UB.

The compiler is allwed to assume that the data is initialized. It doesn't have to prove it.

3 Likes

Just to be clear: reading uninitialized memory is also undefined behavior in C.

3 Likes

I'm aware that the compiler isn't my opponent and trying to slay the programmer, but it still can make assumptions and optimizations based on whether the value is initialized or not (The semi-famous std::mem::uninitialized::<bool>()/println example). But if the compiler cannot figure it out, and assumes that it is all initialized, then how can the compiler go and produce unintentional code if all we're giving it is a byte array?

Similarly, if I have the following code:

let my_array = [1u8, 2, 3, std::mem::uninitialized::<u8>(), 5, 6];
let my_array_2 = black_box(my_array); // Compiler doesn't know that
                                      // my_array_2 contains uninit.
                                      // It could be UB to pass it
                                      // my_array though.
if my_array_2[3] > 128 {
    println!("High");
} else {
    println!("Low");
}

What I'm supposing is that if we have a way to not let the compiler know where we got the data from, a literal black box for the compiler, then how can we produce invalid code?

And even if it is UB to read uninitialized padding bytes, if I only use that function in debug mode, where no optimizations are produced, couldn't I be relatively sure that my program will not do anything unintentional?

IE, would the following make OP's sample safer?

#[cfg(debug_assertions)]
fn get_raw_bytes<T>(value: &T) -> &[u8] {

There are still some optimizations in debug mode. You can't literally turn all optimizations off. And it is just as much undefined behaviour in debug mode, even if it's a bit more difficult to get it to cause miscompilation.

2 Likes

If I understand you correctly, I think you're hoping for some way to access the semantics of the freeze instruction in LLVM. This is kind of like a black box because it stops LLVM from propagating undefined (uninitialized) values, effectively turning a "maybe undefined" value into a "defined but arbitrary" value.

Unfortunately we can't do this in Rust yet (but maybe someday?)

2 Likes

The entry on padding in the Unsafe Code Guidelines is relevant:

Padding can be though of as [Pad; N] for some hypothetical type Pad (of size 1) with the following properties:

  • Pad is valid for any byte, i.e., it has the same validity invariant as MaybeUninit<u8>.
  • Copying Pad ignores the source byte, and writes any value to the target byte. Or, equivalently (in terms of Abstract Machine behavior), copying Pad marks the target byte as uninitialized.

So according to the limited spec we have, this is exactly the same as reading from any other uninitialized memory. Some people have tried to figure out ways make this sort of operation available without UB; see for example:

5 Likes

Imagine an optimization pass that decides it is faster to allocate these values in registers, rather than the stack. And it doesn't allocate any register for my_array[3] because it's not been initialized. What is the code following that supposed to do? Now my_array_2[3] literally refers to no memory location.

1 Like

It'll probably just replace it with true or false on a whim. For example this prints neither Big nor Small

fn main() {
    let a: u32 = unsafe {
        std::mem::uninitialized()
    };
    
    if a < 150 {
        println!("Small");
    }
    if a > 100 {
        println!("Big");
    }
}

playground

In fact it might even say, "they promised not to read from this uninitialized value, so that means the code will never reach this comparison, so I can just optimize it out".

6 Likes

Funnily enough, if you replace <, > with >=, <= it prints both.

But my intent here is to point out that 'uninitialized memory' need not act like (or be) memory at all, so reading it is fundamentally not a meaningful operation.

2 Likes

That's essentially the way that LLVM treats it. If you code something that gets through the rustc frontend yet violates LLVM's input constraints (e.g., by lying to the compiler in an unsafe block), LLVM is free to do anything at all with it, and do so differently each place in the program where the item is referenced, and each time any of your code, invoked library code, or the compiler changes.

1 Like

I'm a bit troubled here by the notion that merely reading uninitialised memory as bytes leads directly to undefined behaviour.

Well, I did some googling, and came up with this link: EXP33-C. Do not read uninitialized memory, which I hope is a good reference. On this page they say:

If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate.

and following the link we find

indeterminate value [ISO/IEC 9899:2011]
Either an unspecified value or a trap representation.

unspecified value [ISO/IEC 9899:2011]
A valid value of the relevant type where the C Standard imposes no requirements on which value is chosen in any instance. An unspecified value cannot be a trap representation.

A "valid value" (as our relevant type is u8). This isn't "undefined behaviour" yet. However: undefined behaviour seems to occur as soon as we look at the data:

The value of an object with automatic storage duration is used while it is indeterminate (6.2.4, 6.7.9, 6.8).

Damn.

This seems excessively severe to me, but yes, I can see how that arises. We've already established that an "unspecified value" is not a stable value, so can produce unpredictable results in code that uses it, so I can imagine that in principle the following code:

let s = format!("{:02x?}", get_raw_bytes(&padded_object));

could end up with unpredictable (and therefore potentially non UTF-8) characters in s ... or even worse behaviour.

I am aware that this topic has been discussed at huge length, particularly over on Internals... It does seem to me though that Rust is making a deliberate choice here (obviously driven by the existing LLVM back end) which is really making a rod for our own back. If nothing else, couldn't we have some kind of optimisation barrier available, something to say: "treat this code as already initialised"?

I'm sorry for naively rehearsing old stuff ... but I am on Users :slight_smile:

1 Like

That's blocked on LLVM as far as I can tell, we need the freeze operation.

2 Likes

Another possibility here is to use volatile reads and writes on the item, which tell LLVM that another unobservable process (e.g., external hardware) may be reading and/or writing the item. That inhibits virtually all optimizations with respect to the item, so expect your code performance to suck massively.

You may believe that a u8 has only 256 possible values, but LLVM and MIRI treat it as having 257 possible values, with the extra value having the meaning unassigned / unspecified / unreachable. That extra state permits LLVM to assume that any referencing code is unreachable, along with any code that always executes with it, so LLVM can ignore or optimize arbitrarily any basic block in which it is referenced. That perhaps-unintuitive behavior is the price that is paid to have a very-highly-optimizing compiler.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.