What counts as undefined behavior

I've been taking closer look at
https://doc.rust-lang.org/reference/behavior-considered-undefined.html
It is rather informal and misleading. It seems to me as though I noticed logical inconsistency in what is defined as UB and we probably should fix that page (or make it less ambiguous).

First of all, let us make clear the distinction between undefined and non-deterministic behavior. If I have a function f defined on domain A and attempt to apply b∉A to f then the output f(b) is undefined. If f is not a function but a relation then the output f(a) is non-deterministic.

EDIT2:
Ok, so uninitialized values are UB and apparently

Then what about
src/alloc/raw_vec.rs.html#131
src/libc/unix/mod.rs.html#583

Formally, those things allocate memory and do not use MaybeUninit type. Hence they produce values that are not inhabitants of their type and thus are UB

Isn't half of stdlib implementation wrong then? Shouldn't RawVec<T> use MaybeIUninit<T>? Should IoSliceMut use MaybeUninit<u8> ? Let's be truly rigorous here. How do you formalize this to be UB

pub fn test() -> MaybeUninit<i32> {
    let Y = MaybeUninit::<i32>::uninit();
    unsafe { Y.assume_init() }
}

while at the same time, Vec::with_capacity and file::read are apparently not UB? So again, I'm modifying my question a little.
END OF EDIT 2

(OLDER) EDIT 1:

I got many great explanations and I want to modify my question a little. So let make the following statement

Producing values that are not inhabitants of their type leads to UB, whereas reading uninitialized memory (such that all possible bit patterns are inhabitants of the type) is non-deterministic but not UB.

Is there anybody who could disprove this statement? And let's also say that rustc doesn't artificially force it to be UB, like @H2CO3 said.

END OF (OLDER) EDIT 1

Second, what is memory? Let us use the formal model of memory https://compcert.org/doc/html/compcert.common.Memory.html. In short, memory m is defined as a set of memory blocks b. Each block has a specific length and can be treated like an array *mut T holding data of specific type. Pointer is a tuple (b,ofs) where ofs is an integer representing offset into b. Pointers can access memory with load function

load chunk m b ofs - performs a read in memory state m, at address b and offset ofs. It returns the value of the memory chunk at that address

The important definition of valid access is

valid_access m chunk b ofs p holds if a memory access of the given chunk is possible in m at address b, ofs with current permissions p. This means:

  • The range of bytes accessed all have current permission p.
  • The offset ofs is aligned.

Hence load is a function defined on domain range_perm m b ofs (this is essentially the length of the memory block m). Naturally undefined behavior is when we apply ofs to load that exceeds the defined domain of this function.

I dislike the working in this paragraph Rust reference

Producing an invalid value, even in private fields and locals. "Producing" a value happens any time a value is assigned to or read from a place, passed to a function/primitive operation or returned from a function/primitive operation. The following values are invalid (at their respective type):

Ok, so what does "invalid value" mean? As we are in strongly typed language, I presume that a is and invalid value of type A if a is some pattern of bits that could not be constructed using the defined constructors of A. In order words such a is not an inhabitant of type A and applying it to any function that accepts A is undefined. The wording "Producing an invalid value" is unfortunate, because just producing (writing) the value, without using it (not even calling drop), cannot lead to undefined behavior. This is just minor detail though. The next issue is much significant.

Examples like "A null fn pointer" or " A discriminant in an enum not included in the type definition" make perfect sense. The example

  • An integer (i*/u*), floating point value (f*), or raw pointer obtained from uninitialized memory, or uninitialized memory in a str.

Makes no sense according to out definition of "invalid value" because every bit pattern (of the right length) if a valid integer, float or string that could in principle be constructed. Those bit patterns are indeed inhabitants of their respective types and can be applied to any function that accepts that type.
Moreover, what does "uninitialized memory" formally mean? If I do

let arr = unsafe{Box::<[u32]>::new_uninit_slice(n).assume_init()};
println!("{:?}",arr);

is this undefined behavior? What if I do

let arr = unsafe{Box::<[u32]>::new_uninit_slice(n).assume_init()};
read_from_some_file(&mut arr);
println!("{:?}",arr);

The first one is clearly non-deterministic, the second one deterministic but the behavior in both is actually well defined. Some people are confused and they think that memory becomes "valid" only after you write to it. They like of thinking is as follows.

There is a logical predicate INIT(x) and when a new memory a:*mut A is allocated a = alloc() it starts with INIT(a)==false. The function write(x,y) writes value y to memory x and it is the only function that does not have INIT(x) as its precondition but has INIT(x) as postcondition. The function read(x) has INIT(x) as a precondition. With such system of axioms it is impossible to read a memory location before writing to it first. People seem to believe that rust imposes INIT(x) as a precondition on all functions (except write) or even that INIT(x) is a compiler invariant (whatever that might mean, since clearly INIT(x) cannot be satisfied at the moment of allocation of x).

Anybody who does not see why this belief is wrong and misleading should look at the following two problems.

(1) First, what good is INIT(x) for? Such a predicate is useless from formal point of view, because it does not imply anything useful. All it tells you is that some value was written to x but doesn't say what that value is. You can't use it to prove any interesting properties and it makes no difference whether you include it as a precondition on your function or not.

EDIT:

So MaybeUninit is literally the predicate INIT. The point still stands. It's quite pointless in proofs. Everything would be much simpler if we got rid of 257th bit in Abstract Rust Machine. It's hard to see a formal justification for its existence.
END OF EDIT

(2) Second, it is impossible to compiler to decide. Look at the example with read_from_some_file above. Suppose it is a call to UNIX that gives control to the system and the system populates arr. How could compiler know that the system actually wrote anything? This can't be decided on compile-time (or even on runtime). So the compiler must then resort to assuming INIT(arr) as an invariant. So if I then read(arr) the behavior is very well defined. The language can't crash, panic or do anything unexpected. It must return the value stored in arr, even if there was a malicious bug in read_from_some_file that caused arr to be "uninitialized". Therefore this line of thinking makes no sense.

And finally I have one last issue with Rust Reference. We should not confuse violating data type invariants with undefined behavior. Take the example

struct Rational{
   numerator:u32,
   denominator:u32,
   // invariant: denominator!=0
}

Of course denominator cannot be 0. I could avoid it with

fn new_rational(n:f32,d:f32)->Rational{
    assert_ne!(d,0);
    Rational{numerator:n, denominator:d}
}

but there are lots of carefree programmer who do not bother making such assertions. Lots of people introduce bugs by not checking their invariants and producing "invalid values" even without unsafe code. So I would say the whole section

Producing an invalid value, even in private fields and locals. "Producing" a value happens any time a value is assigned to or read from a place, passed to a function/primitive operation or returned from a function/primitive operation. The following values are invalid (at their respective type):

should be removed. Violating data structure invariants is not undefined behavior. It leads to bugs but not UB. For example, having a null fn pointer is not UB. Dereferencing invalid memory blocks is UB and null pointer just happens to be invalid memory block. But having a null fn pointer and working with it is not UB. Same goes for all other examples.

This notion has nothing to do with bits. A primitive integer, for example, say u64, has no bit patterns without a valid, unambiguous meaning – every possible bit pattern of 64 bits can be interpreted as a u64. However, the following still produces an invalid value and is UB, because you lied to the compiler about the place being initialized:

let x: u64 = unsafe { std::mem::MaybeUninit::uninit().assume_init() };

How are you so sure? This is in fact wrong; merely creating such a value is instantly UB, whether or not your read or drop it.

Indeed – which means that your definition of "invalid value" is wrong.

Rest assured, it is absolutely, positively, UB to have a null-pointer backed fn or reference.

6 Likes

Your model of undefined behavior is coherent, but is not the one Rust uses, because it is less powerful. For example:

We should not confuse violating data type invariants with undefined behavior.

By defining an invariant known to the language, such as that bool always contains the byte 0x00 or the byte 0x01, we grant the compiler permission to generate machine code that works with that bool in whatever way it likes. For example, it might sometimes generate “jump if not zero” instructions, or it might sometimes use the bool value as an index into an array (which will itself likely cause memory corruption if out of range). The reason that a bool having the memory value 0x02 is UB is so that safe code will always behave in ways that are consistent with some bool being there, even if the value is nonsense. If the compiler were required to treat a non-0/1 bool as defined behavior, then it would have to always generate instructions which gave it some consistent interpretation that does not crash. If it is not UB to have bool 2, then the compilation of if some_bool { foo(); } else { bar(); } will be required to leave behind some optimization opportunities. It's probably trivial for bool, but in general this is very important to automatically generating high-performance code.

The issue with uninitialized bytes / u64s is similar, just subtler — uninitialized bytes aren't magically bad physical memory, they're memory that the compiler is allowed to mess with if that's convenient for it. Thus, reading them may produce inconsistent values (because some write thought it was fine to reuse that memory) — but optimization depends on being able to reorder reads in all sorts of ways.

In examples like the file-reading syscall, it's indeed hard to imagine how anything surprising could happen. But those are the edge cases of defining UB, where it looks silly, not the typical cases, which are hidden within the optimizer's analysis and transformation of all the code you ever write.

14 Likes

obligatory link whenever someone says something like this

13 Likes

Thanks for explanation! Most of it makes sense, though there is one thing still not clear to me

This is not a problem. If I allocate and use memory without initializing it, it has some garbage value and I do not impose any preconditions on that value in my code. So even if the compiler rearranges the reads/writes it will only change one garbage value into another. If I had no precondition on it, it will not be violated. And like you mention, if compiler can nilly-willy write to something populated with sys-call that is going to leave serious potential for UB, though I suppose the compiler could be smart enough to figure out that a mutable reference was passed to foreign function and always treat it as if it was modified.

So if I allocate uninitialized integers, or something else that is guaranteed to be the inhabitant of its type, then it should not be UB. The only UB is if we produce value that is not inhabitant of its type.

Can you elaborate on that?

Could you elaborate, or give an example when this could lead to UB.

Ok, producing values that are not inhabitants of their type should be UB, as it has been pointed out by @kpreid and his bool example.

Indeed, the problem is not just what the compiler is allowed to do with the memory location. The really insidious aspect of UB is the fact that the compiler is simply allowed to assume that there are no execution paths that encounter UB, and if the programmer breaks that promise, the semantics of the entire program become unrestricted by the language specification. This includes anything that happens before the UB part is actually executed! Just as an example, if the compiler can prove that one branch of a conditional jump always eventually leads to UB, for instance the existence of a null reference, or an i32 variable whose value is uninitialized, no matter how many zillions of instructions it takes, it is perfectly allowed simply optimize out the entire branch and any and all resulting dead code.

To the compiler, UB is basically "Okay, this code doesn’t look like dead to me, but I can prove it does have UB, and I know that that can’t happen, so I’ll take your word for it and assume it is, in fact, dead code."

8 Likes

Here is where your errors start: there is no such thing as the precise universal model of memory, there are just specific models used in specific applications. Compcert defines a model of memory, which is fit for their purposes, and is intended to be compatible with the C memory model. That is not Rust's memory model! Rust's memory model isn't currently fully specified and is subject to change, but some features are certain. In particular, this

doesn't apply. Memory in Rust is explicitly untyped, just a sequence of bytes (though a byte isn't what one may naively expect). Type-punning via pointer casts or unions is explicitly allowed. On the other hand, pointer provenance is complicated, and uninitialized memory also works differently than in C or C++.

Rust doesn't have the concept of constructors. The values are constructed using the corresponding literals (e.g. Foo { bar: 5, baz: Mux::Var ). That means that for structs, the valid bit patterns correspond to a cartesian product of valid bit patterns of all fields (mapped to the actual memory according to the layout of the struct, which is by default unspecified), while for enums valid bit patterns are constructed from the disjoint sum of the bit patterns for the variants, which are mapped to memory again in some unspecified way, with unspecified overlapping of different variants.

It's literally how UB works in Rust. Yes, simply producing an invalid value is UB, because you make an implicit assertion to the compiler that some piece of data has some specific properties. For example, &T must always be a live well-aligned not-null pointer to T. If you have p: *const T and you do let s: &T = &*p;, then you are asserting to the compiler that p in fact satisfies the validity invariants of a reference. For example, the pointer will be marked dereferenceable in LLVM, which allows the compiler to insert loads of that memory as it sees fit. If the pointer is dangling, you have immediate UB.

See the links above. I can give you more links, if you want.

I don't discuss in detail the reasoning that you provided next, because it stems from a fundamental misunderstanding of both Rust's memory model and its compilation model.

There are invariants imposed at the library level, and those imposed at the language level. Violating library's logical invariants will lead to a library-level undefined behaviour (it can do whatever random thing), but will not cause language-level undefined behaviour, so you Rational example above is indeed not UB to create (but may cause language-level UB if you, for example, perform the division by 0 due omitting the required checks). Language-level invariants may not be violated, period. Any violation, no matter how seemingly benign, or even if it's in dead code, mean an immediate UB. It may or may not lead to a miscompilation, but your program is broken. A null fn pointer is language-level UB, because the language says so. It will compile and optimize your code under the assumption that fn pointers are non-null, and violating those assumptions can lead to arbitrary results.

3 Likes

This is merely a question of definitions and assumptions. The language designers assert that e.g. producing an uninitialized u64 is undefined; therefore, the compiler will optimize as-if there were never such an uninitialized value; therefore, since this assumption is violated, the generated code may be incorrect.

If you are asking for "proof" in the form of an actual program and actual machine code that happens to crash due to these forms of UB, then you are asking the wrong question. The question of undefined behavior is language-definitional; again, it's not about bits. It is not the case that "if you are currently unable to give constructive proof of UB, then it's not actually UB".


I started to make up an elaborate example about an uninitialized value, but expanding on @jdahlstrom's observation is actually much easier. Consider the following code:

let nuke_being_launched = Nukes::launch();
let x: u64 = unsafe { MaybeUninit::uninit().assume_init() }; // UB
nuke_being_launched.cancel();

This is supposed to launch the nukes and then immediately cancel them. However, the creation of the uninitialized u64 is UB, so the compiler can assume it never happens, so it can in particular assume that anything after it doesn't happen, either. Therefore, it might feel free to remove the entire cancellation mechanistm and just allow launching the nukes anyway.

2 Likes

I'm afraid in all but a few trivial cases the problem is undecidable. But in those cases where it can prove, then that's fine. That's why we should very carefully analyze what exactly should count as UB.

Producing a null reference would be equivalent to producing an invalid value that is not an inhabitant of its type. I agree on this with @kpreid and @H2CO3 . So what about the cases when all values are guaranteed to be inhabitants of their type, like e.g. having uninitialized integer or string. Can you show why this could lead to UB?

Uninitialized values are an abstract concept, even if all bits are legitimate values, but this can have real consequences to optimization. For example, we would like this to avoid writing memory at all:

let array = [MaybeUninit::<i32>::uninit(); 1000];

Semantically, if we froze the "value" of that uninit() call, then it should be copied to every element of that array initialization. But if we leave it truly uninitialized, then it will have whatever garbage value was already in that space on the stack, and array[0] may have different bits than array[1], etc.

5 Likes

This is exactly what the article that @RustyYato posted above is about. Read the "What is unitialized memory?" section if nothing else.

3 Likes

No; UB is an operational concept. If execution passes through an operation which is UB, the program has UB. If execution does not, the program does not.

If dead code causing UB invalidated the program, then it would be impossible to use e.g. unreachable_unchecked at all.

The insidious part is that UB time travels, the most common being making its branch as dead and removing it, so it appears when tracing the built machine executable that the UB branch wasn't taken. But on the Abstract Machine, the UB branch was taken, so no behavior of the output machine binary is guaranteed.

Any optimizations which violate the property that unexecuted UB does not change the behavior of a defined program are unsound, plain and simple. From the model of the Abstract Machine, the only property of UB is that it doesn't happen at the responsibility of the author of the code.

One simple to understand reason is that given an uninitialized byte x, it can be the case that both x == VAL and x != VAL. This is a nice example because it can happen on actual hardware; the canonical example is MADV_FREE causing reads of uninitialized data to return nondeterministic values.

Bytes on the Rust Abstract Machine are not just some value 0..=255. If you haven't already, please read the articles linked upthread.

5 Likes

Not at all. Or, rather, most real-world programs are those ”trivial” cases. Proving properties of real-world programs is mostly a question of how much effort (and resources) the compiler, and compiler authors, are reasonably expected to expend, not any theoretical impossibility result. After all, all real-world programs are merely finite state machines and perfectly decidable!

1 Like

Thank you all for the explanations!

There is however, one last argument that come to my mind. I was not aware that LLVM can write to uninitialized memory to reuse registers/memory and this can indeed lead to nondeterminism. However,

So let me state it this way:

Producing values that are not inhabitants of their type leads to UB, whereas reading uninitialized memory (such that all possible bit patterns are inhabitants of the type) is non-deterministic but not UB.

Is there anybody who could disprove this statement? And let's also say that rustc doesn't artificially force it to be UB, like @H2CO3 said

Is there a solid reason for it still being UB?

This is completely wrong. If by finite state machines you mean finite state automata, then pushdown automata are not FSA. If you use this term in a broader sense that refers to all machines at all levels of Chomsky hierarchy, then Turing machines are "not decidable" (whatever that might mean? Set membership problems can be (un)decidable. Models of computation can decide (non)equivalent classes of problems). The halting problem of Turing machines is recursively enumerable but not decidable. And most problems in formal methods and program verification are undecidable (finding formal proofs is recursively enumerable).

That is, of course, not true. UB is whatever the language defines as UB. For example, the C standard defines as UB "a nonempty source file does not end in a new-line character ..", "a character not in the basic source character set is encountered in a source file .." and "the program declares or defines a reserved identifier ..". It also includes such non-operational concepts as invalid linkage specifiers, invalid extern funciton prototypes, and even unknown pragmas (which GCC famously exploited for an easter egg).

unreachable_unchecked is an operational UB as you define, but e.g. producing an invalid value is not. The compiler is fully within its right to optimize the following code to ret 0, even though the UB operation is statically known to never happen:

pub unsafe fn foo(x: u32, y: u32) -> u32 {
    if false {
        let _b: bool = *(&x as *const u32).cast();
    }
    if x < 2 { 1 } else { 0 }
}

It's just a matter of running the propagation of contraints from types before dead code elimination. It's unlikely to happen in this case for performance reasons (dead code elimination basically always runs first), but can easily happen in more complicated cases.

That doesn't make the optimizers's job undecidable -- it would be undecidable if it had to produce the optimal code (in some sense), but nobody claims that it does or that it has to.

As an aside, I'm not convinced Rust is Turing complete -- there is no easy way to simulate an infinite tape in Rust, Rust pointers can only address a finite number (usize::MAX + 1) of bytes. I'm more convinced that Python is (under the right abstraction) Turing-complete, but not so much for Rust.

That's not true. There is never UB in that code and the compiler can't optimize it to return 0. It has to return 1 if x < 2, and 0 if x >= 2.

4 Likes

If the contract between the programmer and rustc/LLVM states that reading an uninitialised variable is undefined behaviour, then the behaviour isn't "non-deterministic", it's not defined.

As an analogy, you could say that ∞/∞ is "not deterministic" because sometimes it can be 1 and other times it can be 3.14159, or any other number depending on how you calculate it. But that's not correct, as per the rules mathematics follows, the division is undefined and it doesn't make sense to talk about the result at all.

5 Likes

The Rust abstract machine defines bytes to have 257 different values. One for each bit pattern, and another value for uninitialized. (+ possibly more values for pointer provenance, but let us put that aside) Furthermore, the uninitialized value is defined to not be in the domain of integers, so such a value is not an inhabitant of the integer types.

To see that this is true UB, and not just non-deterministic, consider this example:

// Compile me with release mode!
use std::mem::MaybeUninit;

fn always_returns_true(x: u8) -> bool {
    x < 120 || x == 120 || x > 120
}

pub fn make_true() -> bool {
    // Just claim we initialized this even though we did not.
    let x: u8 = unsafe { MaybeUninit::uninit().assume_init() };
    always_returns_true(x)
}

fn main() {
    assert!(make_true());
}
thread 'main' panicked at 'assertion failed: make_true()', src/main.rs:14:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

playground, source

To show that this is not an issue specific to rustc, consider the same example in C++ also returning false. (The xor instruction creates the value "false".)

16 Likes