I've been taking closer look at
https://doc.rust-lang.org/reference/behavior-considered-undefined.html
It is rather informal and misleading. It seems to me as though I noticed logical inconsistency in what is defined as UB and we probably should fix that page (or make it less ambiguous).
First of all, let us make clear the distinction between undefined and non-deterministic behavior. If I have a function f
defined on domain A
and attempt to apply b∉A
to f
then the output f(b)
is undefined. If f
is not a function but a relation then the output f(a)
is non-deterministic.
EDIT2:
Ok, so uninitialized values are UB and apparently
Then what about
src/alloc/raw_vec.rs.html#131
src/libc/unix/mod.rs.html#583
Formally, those things allocate memory and do not use MaybeUninit type. Hence they produce values that are not inhabitants of their type and thus are UB
Isn't half of stdlib implementation wrong then? Shouldn't RawVec<T>
use MaybeIUninit<T>
? Should IoSliceMut
use MaybeUninit<u8>
? Let's be truly rigorous here. How do you formalize this to be UB
pub fn test() -> MaybeUninit<i32> {
let Y = MaybeUninit::<i32>::uninit();
unsafe { Y.assume_init() }
}
while at the same time, Vec::with_capacity and file::read are apparently not UB? So again, I'm modifying my question a little.
END OF EDIT 2
(OLDER) EDIT 1:
I got many great explanations and I want to modify my question a little. So let make the following statement
Producing values that are not inhabitants of their type leads to UB, whereas reading uninitialized memory (such that all possible bit patterns are inhabitants of the type) is non-deterministic but not UB.
Is there anybody who could disprove this statement? And let's also say that rustc
doesn't artificially force it to be UB, like @H2CO3 said.
END OF (OLDER) EDIT 1
Second, what is memory? Let us use the formal model of memory https://compcert.org/doc/html/compcert.common.Memory.html. In short, memory m
is defined as a set of memory blocks b
. Each block has a specific length and can be treated like an array *mut T
holding data of specific type. Pointer is a tuple (b,ofs)
where ofs
is an integer representing offset into b
. Pointers can access memory with load
function
load chunk m b ofs - performs a read in memory state m, at address b and offset ofs. It returns the value of the memory chunk at that address
The important definition of valid access is
valid_access m chunk b ofs p holds if a memory access of the given chunk is possible in m at address b, ofs with current permissions p. This means:
- The range of bytes accessed all have current permission p.
- The offset ofs is aligned.
Hence load
is a function defined on domain range_perm m b ofs
(this is essentially the length of the memory block m
). Naturally undefined behavior is when we apply ofs
to load
that exceeds the defined domain of this function.
I dislike the working in this paragraph Rust reference
Producing an invalid value, even in private fields and locals. "Producing" a value happens any time a value is assigned to or read from a place, passed to a function/primitive operation or returned from a function/primitive operation. The following values are invalid (at their respective type):
Ok, so what does "invalid value" mean? As we are in strongly typed language, I presume that a
is and invalid value of type A
if a
is some pattern of bits that could not be constructed using the defined constructors of A
. In order words such a
is not an inhabitant of type A
and applying it to any function that accepts A
is undefined. The wording "Producing an invalid value" is unfortunate, because just producing (writing) the value, without using it (not even calling drop
), cannot lead to undefined behavior. This is just minor detail though. The next issue is much significant.
Examples like "A null fn pointer" or " A discriminant in an enum
not included in the type definition" make perfect sense. The example
- An integer (
i*
/u*
), floating point value (f*
), or raw pointer obtained from uninitialized memory, or uninitialized memory in astr
.
Makes no sense according to out definition of "invalid value" because every bit pattern (of the right length) if a valid integer, float or string that could in principle be constructed. Those bit patterns are indeed inhabitants of their respective types and can be applied to any function that accepts that type.
Moreover, what does "uninitialized memory" formally mean? If I do
let arr = unsafe{Box::<[u32]>::new_uninit_slice(n).assume_init()};
println!("{:?}",arr);
is this undefined behavior? What if I do
let arr = unsafe{Box::<[u32]>::new_uninit_slice(n).assume_init()};
read_from_some_file(&mut arr);
println!("{:?}",arr);
The first one is clearly non-deterministic, the second one deterministic but the behavior in both is actually well defined. Some people are confused and they think that memory becomes "valid" only after you write to it. They like of thinking is as follows.
There is a logical predicate INIT(x)
and when a new memory a:*mut A
is allocated a = alloc()
it starts with INIT(a)==false
. The function write(x,y)
writes value y
to memory x
and it is the only function that does not have INIT(x)
as its precondition but has INIT(x)
as postcondition. The function read(x)
has INIT(x)
as a precondition. With such system of axioms it is impossible to read a memory location before writing to it first. People seem to believe that rust imposes INIT(x)
as a precondition on all functions (except write
) or even that INIT(x)
is a compiler invariant (whatever that might mean, since clearly INIT(x)
cannot be satisfied at the moment of allocation of x
).
Anybody who does not see why this belief is wrong and misleading should look at the following two problems.
(1) First, what good is INIT(x)
for? Such a predicate is useless from formal point of view, because it does not imply anything useful. All it tells you is that some value was written to x
but doesn't say what that value is. You can't use it to prove any interesting properties and it makes no difference whether you include it as a precondition on your function or not.
EDIT:
So MaybeUninit
is literally the predicate INIT
. The point still stands. It's quite pointless in proofs. Everything would be much simpler if we got rid of 257th bit in Abstract Rust Machine. It's hard to see a formal justification for its existence.
END OF EDIT
(2) Second, it is impossible to compiler to decide. Look at the example with read_from_some_file
above. Suppose it is a call to UNIX that gives control to the system and the system populates arr
. How could compiler know that the system actually wrote anything? This can't be decided on compile-time (or even on runtime). So the compiler must then resort to assuming INIT(arr)
as an invariant. So if I then read(arr)
the behavior is very well defined. The language can't crash, panic or do anything unexpected. It must return the value stored in arr
, even if there was a malicious bug in read_from_some_file
that caused arr
to be "uninitialized". Therefore this line of thinking makes no sense.
And finally I have one last issue with Rust Reference. We should not confuse violating data type invariants with undefined behavior. Take the example
struct Rational{
numerator:u32,
denominator:u32,
// invariant: denominator!=0
}
Of course denominator cannot be 0
. I could avoid it with
fn new_rational(n:f32,d:f32)->Rational{
assert_ne!(d,0);
Rational{numerator:n, denominator:d}
}
but there are lots of carefree programmer who do not bother making such assertions. Lots of people introduce bugs by not checking their invariants and producing "invalid values" even without unsafe code. So I would say the whole section
Producing an invalid value, even in private fields and locals. "Producing" a value happens any time a value is assigned to or read from a place, passed to a function/primitive operation or returned from a function/primitive operation. The following values are invalid (at their respective type):
should be removed. Violating data structure invariants is not undefined behavior. It leads to bugs but not UB. For example, having a null fn pointer is not UB. Dereferencing invalid memory blocks is UB and null pointer just happens to be invalid memory block. But having a null fn pointer and working with it is not UB. Same goes for all other examples.