What counts as undefined behavior

This is exactly what @H2CO3 said. I'm asking if there is any good reason for this. Maybe we could just change the spec if there is nothing in rustc that makes it impossible?

The playground link is broken btw. But this is a great answer! But actually, how is this possible? Byte only has 256 possible values.

EDIT:

Oh... I see. Rustc determines this at compile-time and reduces the entire thing directly to false. But that just seems artificial to me. And I could imagine there are situations where the compiler can't figure this out at compile-time (like when doing sys-call or any foreign function call).

This is exactly what we are trying to tell you. The abstract machine is not the same as the hardware. There is a special "uninitialized" value for each type, with basically a license for the compiler to interpret it as however it pleases.

To clarify, it is not actually the case that a real, 8-bit integer number happened to be neither less, equal, nor greater than 120. Instead, it is the case that your program contained an invalid operation, and thus the compiler emitted nonsense. There is probably no comparison or logical disjunction in the generated assembly at all.

The same assertion could never fire if you don't write soundness bugs and always initialize your values, because in that case, the compiler wouldn't optimize away your code and replace it with garbage that appears to be mathematically impossible.

8 Likes

Thanks! I fixed the playground link.

The 257 values only apply when optimizing your code, and this allows for additional optimizations. Obviously the "257th value" does not exist in the compiled artifact. For example, things like a "write x to y" compile down to a no-op if x is the uninitialized value. The fact that there's no way to check whether memory has been initialized or not makes this possible.


If you are interested in this topic, you may want to look at the MaybeUninit<T> type which always has the same size and alignment as T, but has all possible values in its domain, including uninitialized memory. Thus, values of this type can exist even if they're uninitialized. It can also store things such as a 2 in a MaybeUninit<bool>.

In fact, by copying around such values, memory that has been initialized can be deinitialized by writing an MaybeUninit::uninit() to it. For example, consider the &mut UninitSlice type from the bytes crate which is like an &mut [MaybeUninit<u8>], but which exists only to forbid the de-initialization of any regions of the memory that have already been initialized. This allows code to initialize part of the region, pass out the reference to unknown code, and then to trust that the initialized region is still initialized. If it passed out a &mut [MaybeUninit<u8>], then this would be wrong, as the unknown code could have written a MaybeUninit::uninit() to the region.

7 Likes

Right, I see. But the compiler produces nonsense simply because rustc said this should be UB. I could imagine writing rustc in such a way that treats this as nondeterministic behavior instead of UB. And this trick is only possible if rustc can figure this out at compile time. There might be many situations where this is not possible (like with foreign function calls, sys-calls, etc.). And in those cases when rust can figure out UB at compile-time wouldn't it be a much better idea to print an error during compilation, instead of removing the entire code silently? This still seems like a somewhat arbitrary decision to force this to be UB, where it really should just have been up to the programmer to keep track of their preconditions and invariants in their own code.

Safe rust code makes it impossible (or rather to need such a case.) You are using a different advanced language (which should be avoided) when writing an unsafe block.

Optimization.

Well, the problem is that there's a reason that rustc (and other LLVM-based compilers) consider it UB. You can find the original motivation for adding undef to LLVM here, where it is used to eliminate various types of dead code that the compiler inserted itself.

There are situations in which the compiler generates some code at one stage, which sometimes turns out to be dead, relying on a future compilation pass to optimize it out. The optimizer doesn't keep track of enough information to tell the difference between "this is code I put in myself in a previous optimization pass" and "this came from the original source code", so it cannot emit such errors in the vast majority of circumstances.

8 Likes

I understand eliminating dead code is a good reason for adding undef, but what's the reason for making uninitialized code undef too? Wouldn't we be fine with compiling

int test() {
  int Y;
  return Y;
}

to

int %test() {
  %Y = alloca int
  %YV = load int* %Y
  ret int %YV
}

Especially in unsafe code. I see a good reason why I would like such behavior. It would be more compatible with Hoare logic axioms of array and variable assignment. If I have a function f(x:i32)->i32 that has no preconditions on x and some interesting postconditions, then I should be able to apply even uninitialized x to f(x), because it satisfies preconditions vacuously. An real-life example of such function is a sys-call that populates buffer from user input (well, it has precondition on buffer length but not on its content). It would also allow for writing unit tests that assert postconditions of functions that have no preconditions. We could efficiently deal with functions that take as input large arrays but have no preconditions on array inputs as well. It would overall be more elegant and compatible with Hoare-logic-style axiomatizations (like in CompCert). It seems to me like this decision to make uninitialized reads UB, was more like an arbitrary engineering choice, rather than conscious decision dictated by some formal system of thinking.

Well, by defining its signature to take i32, you have implicitly included a precondition that x cannot be initialized. On the other hand, if you changed its signature to accept MaybeUninit<i32> to allow x to be uninitialized, now it would suddenly be UB for f to do anything that depends on its value, as you don't know whether it is initialized or not.

Making uninitialized reads UB is a conscious engineering choice that ultimately originate from it being a useful definition for optimizations. That said, I assure you that a lot of thought has been put in to how these things can be formalized.

5 Likes

Isn't this C code? This isn't particularly relevant to what is undefined in Rust. In Rust, analogous code will not compile.

Actually, I find it quite relevant. The rules for uninitialized memory originate from C. Rust has merely inherited them.

To clarify, the snippet comes from a link I posted.

The discussion centers around the idea that @aleksander-mendoza considers it unintuitive that this C code has undefined behavior. And rightly so. But Rust will prevent you from writing such code.

If you rewrite this in Rust to actually read uninitialized memory, e.g.:

pub fn test() -> i32 {
    let Y = MaybeUninit::<i32>::uninit();
    unsafe { Y.assume_init() }
}

Then it becomes very clear that something is wrong with this, given that you had to write "assume_init" on an uninitialized variable, and "unsafe".

@tczajka The link that the snippet comes from has examples where uninit lets us optimize out dead code. It's not about code that's unsound. In fact, C code is more relaxed around these rules, so it really translates more closely into this:

pub fn test() -> MaybeUninit<i32> {
    let Y = MaybeUninit::<i32>::uninit();
    Y
}

with the .assume_init() call happening on first use of Y in the caller, meaning that the C code is sound if you never use the return value. The point is then that before adding uninit to C, Y would be initialized to zero instead, so the function would end with an mov EAX, 0 to zero the return value. After adding uninit to C, this mov instruction could be optimized out instead, since "mov EAX, uninit" is a no-op.

1 Like

What you're looking for is called the freeze operation, which takes an uninitialized value and concretizes it into an arbitrary-but-consistent bitstring. Rust doesn't currently expose this operation, but it's very possible that it will in the future.

These PDF slides may help clarifying why undefined values exist in LLVM's Abstract Machine.

It's an important thing to understand that when you write Rust code, you are not programming against your target CPU — you are programming against the Rust Abstract Machine, and all of the compiler analysis and optimization of the code is based on the Code's semantics on the Rust Abstract Machine. Without a formal specification of the Rust Abstract Machine, it is informally defined by how rustc lowers Rust code to MIR and then to LLIR, which does have a (currently self-contradictory but at least present) somewhat defined model. As a matter of practicality, Rust inherits most of its memory model directly from LLVM.

sadness and unknowns

At the current moment, I'm not convinced that this is the case. Or well, to be clear: you can absolutely write 0x02_u8 to MaybeUninit<bool> and then do a typed copy of that MaybeUninit<bool>. What I'm unsure of is whether reading that value from the new location is guaranteed to read 0x02.

What this comes down to is that MaybeUninit is #[repr(transparent)] and has the layout and ABI of the wrapped type. This means that with an ABI that for example passes (u64, u32) separately in registers, the padding bytes are not preserved on a typed copy, so thus MaybeUninit<(u64, u32)> on such an ABI would not preserve the padding bytes.

This also isn't an issue we can simply just say is a bug in the current definition of the standard library; we absolutely do want to inherit ABI for things like MaybeUninit<#[repr(simd)] f32x4> to be passed directly in simd registers rather than general purpose.

bool is probably safe — we say that bool happens to match C _Bool on all current platforms, but make no promises about other platforms, so we can say that bool has the ABI of uint8_tC rather than _BoolC on the platforms where the ABI of _Bool mandates that it only be 0x00 or 0x01.

Basically, I'd be careful with MaybeUninit. As much as we would like #[repr(Rust)] union to be a simple untyped bag of bytes, MaybeUninit is by necessity at least a bit more complicated. I'd currently ideally only count on bytes which are potentially valid for the wrapped type being preserved and treat any other bytes as flushing to uninit on typed copy.

8 Likes

Ok, so uninitialized values are UB and apparently

Then what about

Formally, those things allocate memory and do not use MaybeUninit type. Hence they produce values that are not inhabitants of their type. Isn't half of stdlib implementation wrong then? Shouldn't RawVec use MaybeIUninit? Should IoSliceMut use MaybeUninit ? Let's be truly rigorous here. How do you formalize this to be UB

pub fn test() -> MaybeUninit<i32> {
    let Y = MaybeUninit::<i32>::uninit();
    Y
}

while at the same time, Vec::with_capacity and file::read are apparently not UB?

The difference is that they aren't "producing" the value.

It's perfectly fine to have a pointer to whatever. When you have a raw pointer, it doesn't matter what's behind that pointer, or if that pointer is even valid in the first place.

What matters is when you actually read a value which isn't valid. At the Abstract Machine level, the UB happens when you do a "typed copy" of the value. (At least in the current proposed model, anyway. There's an open discussion on "invalid not used again" values.)

(References are in a weird soft middle ground where it's not exactly clear one way or the other whether they care about the validity of the pointee. For now, presume that references to invalid values are themselves invalid — they're certainly unsound to expose since reading them is safe and UB as it produces the invalid value.)

Also, that example

isn't UB. (It can't be — it's safe code.) The UB occurs when you do .assume_init() to turn the uninit into i32.

3 Likes

Raw pointers are allowed to point at anything, including memory that doesn't contain a valid value for the pointer type. It is not until you try to read from the memory that such a requirement comes into play. The vector type is ok because it never reads from its allocation until after it has written something. (It keeps track of how much has been initialized using its length field.)

This is not UB.

(There's no unsafe, so it could not possibly be UB.)

6 Likes

Note that this discussion is on the border between topics suited for each of Rust’s two main fora:

  • If you are mostly interested in clarification of what is allowed today, you’re in the right place.
  • If you are mostly interested in advocating a change to the rules, the internals forum would be a better place for the discussion.
2 Likes

So in that case what is the type of data stored behind the pointer?

let p:*mut u32 = allocate_stuff();

Is the type T or MaybeUninit<T>? If I do

let i:u32 = unsafe{*t};

what is the type of *? Is it a function *mut T -> T or *mut T -> MaybeUninit<T> or *mut MaybeUninit<T> -> T? Shouldn't we always write

let i:u32 = unsafe{MaybeUninit::assume_init(*t)};

From what I understand, formally, allocation gives me a pointer *mut T but behind this pointer is actually data of type MaybeUninit<T> initialized with the "invalid value" (257th byte value, as you said).
I suspect that dereferencing a pointer automatically and implicitly calls assume_init and converts MaybeUninit<T> to T with precondition that the value is not invalid (not 257th byte). So let's say I violate this precondition and dereference 257th value anyway. The 257th value is only a compile-time abstraction. It cannot live on run-time. So if rust can figure out at compile-time that invalid value is being dereferenced it will be free to remove my code. But when this cannot be decided at compile-time, rustc instead has to resort to simply assuming (as a compiler invariant) that I do not dereference invalid value. Then it is up to me to ensure this at runtime, right? But 257th byte does not exist at runtime, therefore whatever I do, even if I dereference uninitialized memory, I will not dereference invalid value. Therefore this entire invariant that "user cannot dereference 257th byte" is a tautology (at runtime). And if rust does not artificially enforce UB upon me (by removing my code at compile time), but instead goes ahead with my "illegal" code anyway, then could my "illegal" operation result in UB at runtime? If I do

let t = unsafe{*p};

the behavior is actually well defined. It must read whatever value is behind the pointer because it couldn't decide this to be UB at compile time (and didn't break me code). This dereference can't crash at runtime, there is no 257th byte at runtime. And even if rust wanted to insert some checks at runtime (like it does for out of bounds access to slices), there really is nothing it could possibly check (unless it makes scan of entire RAM and comapres it somehow). So this entire invariant is not only a tuatology at runtime but also deciding wether memory has been initialized or not is impossible in general.

Is there anything I got wrong about this?

As far as I see, rustc has this pesky 257th byte for no good reason at all and it sometimes leads rustc to shoot itself in the foot, even if I write perfectly good and correct code. This whole "invalid value" thing is just to religiously protect users from uninitialized data as if it is some devil, while in reality uninitialized data is nothing more than a variable that has trivial precondition (of always being true). Working with uninitialized data is as much of a wrong thing as it is to introduce any other bug in the code. If you violate preconditions then you it is not because "uninitialized data is bad" but because you violate preconditions and have a bug in your code. Unless Rust incorporates some SMT-solver and formal specs language, this can't be helped with. And notice that all MaybeUninit does is just working like a predicate that checks if memory was written to, but does not tell me what has been written there. If my proof of correctness does not care what has been written to a variable, then working with uninitialized data is perfectly valid. Any proof of correctness that has "x must be initialized" as a precondition will still hold even without that precondition. After all, what does "initialized" even mean? This word has no formal meaning. "x was written to" is a useless predicate, just like Any type would be useless in Haskell

I'd say MaybeUninit is a useless language feature that shouldn't be there. It does more harm than good. Actually reading uninitialized data is not undefined behavior. It is a defined but nondeterministic behavior (like reading input from file/console/etc.). Rust just artificially makes it into UB. If my reasoning above is correct then maybe we should really make a thread like this in the internal forum, like @2e71828 suggests. We can't get rid of MaybeUninit for backwards compatibility, but we can change spec so that there is no unnecessary UB. After all, rust claim to be better than C because its better at avoiding UB, right? (Producing uninitialized values should still be impossible in safe code of course, but if a user believes their preconditions allow using uninitialized data, rust should not blow up their unsafe code)

Yes. Dereferencing pointer is binding promise from the developer to the compiler. He (or she) have to ensure that you are accessing valid data. Not undefined one.

Most CPU ISAs don't have UBs (most, not all!). Thus yes, in runtime you wouldn't get 257th value (if you are not using Miri).

This would make all your data structures volatile and make most optimizations not possible. The whole point of UBs is to make programmer obey certain rules and thus make optiimizations possible.

Yes, but since that's task which programmer must do and not a compiler it should be possible to determine that. It's duty of the programmer not to write code which can not be reasoned about.

Perfectly good and correct code never reads 257th byte. That's definition of “perfectly good and correct code” as far as compiler is concerned.

It's there to remove dead, useless code. Good code can not read 257th value, thus by definition all code that does that is bad, dead, and can be removed.

Kinda. Only any UB is “artificial”. Compiled binary is predictable. You can run it billion times and if environment is unchanged, it would produce the same result billion times.

UB is just something which shouldn't even happen in a correct and valid program.

What do you plan to accomplish by doing that?

Safe Rust, yes. Unsafe Rust, no. It has some UBs which C doesn't have, even.

Why? No, really, why? Why this specific single UB bothers you so much that you want to declare jihad on it and try to turn it into not UB. What's the end goal?

Consider the following code (not Rust, but I hope it's clear enough):

int set(int x) {
    int a;
    a = x;
}

int add(int y) {
    int a;
    return a + y;
}

int main() {
    int sum;
    set(2);
    sum = add(3);
    printf("%d\n", sum);
}

If I know how stack works, how memory works, how assembler works… I may expect that it should work and return 5. And it actually does return 5 in many old compilers! It even returns five on modern compilers if you disable optimizations!

Now, can you explain why it's not Ok for me to expect 5 from this program yet it's Ok for you to expect some predictable output from program which reads uninitialized memory?

They both violate the exact same rule (reading uninitialized memory is undefined behavior) and they both expect that everything would work because developer knows real world better than compiler.

And if this program is actually supposed to work in Rust then how can compiler do any optimizations at all?

3 Likes