Is it UB to read uninitialized integers in a Vector? Why?

Take this simple program:

fn main() {
    let length = 4;
    let mut vec = Vec::<u8>::with_capacity(length);
    unsafe {
        vec.set_len(length);
    }
    dbg!(vec);
}

The values of the vector are uninitialized but by unsafely setting the length of the vector we can get dbg! to read them anyway. Is this Undefined Behaviour?

From reading the MaybeUninit documentation I get the impression that it is:

Moreover, uninitialized memory is special in that the compiler knows that it does not have a fixed value. This makes it undefined behavior to have uninitialized data in a variable even if that variable has an integer type, which otherwise can hold any fixed bit pattern:

But maybe Vec has more guarantees? The Vec docs say:

There is one case which we will not break, however: using unsafe code to write to the excess capacity, and then increasing the length to match, is always valid.

Ok, but what if we don't actually write to the excess capacity? set_len says this:

The elements at old_len..new_len must be initialized.

So that seems pretty clear. They must be initialized without exception. But wait! The example given says you can use FFI function call to initialize the values. So we can simply add

fn main() {
    let length = 4;
    let mut vec = Vec::<u8>::with_capacity(length);
    unsafe {
        ffi_initializer(vec.as_mut_ptr(), length);
        vec.set_len(length);
    }
    dbg!(vec);
}

But the compiler may know nothing about ffi_initializer and what it might do. With dynamic linking the function may not even exist at the time I compile this. So how does the compiler know which elements are or are not initialized?

From the compiler's point of view, surely my first and last programs are the same? Or does it assume any FFI call may set the values? Have I got confused somewhere?

I understand that exposing uninitialized bytes is undesirable for a lot of reasons, but I'm not understanding if or how it's UB.

1 Like

Yes, from the compiler's point of view, they are the same. But you used an unsafe block, which means you are responsible for the correctness of the code.

In this case it's the difference between validity and safety invariants. The compiler knows about validity invariants, and will exploit them for profit. Some examples are: references must be aligned, niches on types, and exclusive references are not aliased by unrelated pointers/references. Breaking a validity invariant in instant UB, and is always wrong.

Safety invariants are created and defined by libraries, and the compiler never directly exploits these. However, the libraries may exploit these for profit. For example: String may only contain valid UTF-8, the Send/Sync traits, and the first len elements of a Vec must be initialized.

Breaking safety invariants may not immediately invoke UB. In fact, if you are the author, you will likely be breaking the safety invariants internally, but as long as that is not exposed to a safe interface, its fine. If you are not the author, and the type in question as strong guarantees, then you may be allowed to temporarily break the safety invariants.

But this is not the case with Vec. Vec clearly states that you must initialize the first len elements before calling set_len. Failing to do so breaks Vec's safety guarantees and gives Vec permission to invoke UB.

Having said all this, the question: "are uninitialized integers fine?" is contentious, and unanswered. So it is best to treat the answer as no for the time being. Uninitialized integers are UB, unless otherwise specified.

13 Likes

I guess what I'm not understanding is how it could break Vec's safety guarantees in the first case but not in the FFI case. It has no possible way of knowing which case is being invoked so surely it has to assume the FFI case or otherwise it'd break existing code?

cross-language LTO can remove the "black box" behavior of FFI.

But the underlying issue is that what the compiler sees and what the machine sees doesn't really matter for UB. What the hardware does is not what your program does.

In this case, MADV_FREE can be used to actually observe nondetermism here. Here's a link on irlo as well as an actual facebook bug due to MADV_FREE and uninitialized memory.

The TL;DR is that having an uninitialized integer is UB because we have not defined the behavior of the abstract machine in the face of uninitialized memory typed as an integer. Just because the compiler doesn't (or can't) exploit this today does not mean it can't (or won't) do so tomorrow.

12 Likes

I see, so if I'm understanding this right, MADV_FREE means that reading an uninitialized byte can return null if the page is not in memory or an arbitrary value if it is. So calling read twice could result in different values. The only way to make it deterministic is for something to write a value to that memory. It could be another application even.

Some other combination of hardware/OS/allocator/compiler/etc could potentially produce other nondeterministic behaviour.

So setting the length of a Vec is fine by itself. But anything that assumes Vec's data is immutable (e.g. shared references) could break in unanticipated ways, hence UB.

And the trick is you don't have to make that assumption; the compiler can make that assumption for you. (By, for example, loading once instead of multiple times, or multiple times instead of once.)

4 Likes

It doesn't.

You, as many others, seem to assume that UB works like this:

  • compiler actively finds UB in your program
  • compiler goes "Ha! I'm gonna punish this nasty guy by inserting $BUG here:"

However, undefined behavior does not work like this. In reality, it works like this:

  • Compiler assumes you were a good guy
  • Compiler optimizes as if you were a good guy
  • Some of the optimizations are invalid if your code has UB, but the compiler isn't doing that. Just like the equality A * B = B * A doesn't hold if the multiplication operator isn't commutative.

So you end up with buggy code but not because the compiler actively found out about you violating some specific assumption somewhere, but because there happens to be a discord about what assumptions it needed to make and what guarantees your code gave or broke.

So in this specific case, the compiler doesn't need to know which elements are initialized or uninitialized. If you read a value, it assumes that the value is initialized. If in reality it isn't, you have a bug and it may or may not show itself.

26 Likes

This prints nothing in release mode:

fn main() {
    let a: u32 = unsafe {
        std::mem::uninitialized()
    };

    if a < 150 {
        println!("small");
    }

    if a > 100 {
        println!("big");
    }
}

playground

This is undefined behaviour in action. The reasoning of the compiler might be: Well I know a is uninitialized, so it is UB to reach the first if, therefore it never happens, therefore I can remove it.

18 Likes

For more info about uninit memory, and how to properly manipulate it in Rust, I suggest you have a look at ::uninit's documentation, and maybe even give it a try (disclaimer: I'm the author of that crate).

6 Likes

Thanks everyone, I think I get it now. Forgive me for not marking any single answer as the solution because I think you all gave me a piece of this puzzle.

7 Likes

This one's nuanced. undef is weird -- I strongly suggest reading LLVM Language Reference Manual — LLVM 18.0.0git documentation as it talks about common misconceptions.

It's (for now at least) not UB to have an undef u8, but safe functions are allowed to assume they didn't get it, so passing it to anything safe -- such as the formatting in dbg! -- is contraindicated. For example, division is safe even though 1 / undef is UB.

1 Like

As rustc_codegen_cranelift has been maturing nicely recently, I would personally be more hesitant to attempt to specify Rust UB in terms of LLVM these days. We don't know what the future holds...

3 Likes

To be perfectly clear: we haven't yet defined the behavior for having undefined typed at u8. That makes it undefined behavior.

This is "weaker" UB, inasmuch as that is a thing, however, as it's not that quick to be UB in LLVM IR, which Rust compiles to. Along with that, there's consideration of defining the behavior to be allowed (in part because as currently implemented it's operationally defined, and used.)

Also worth noting, though, is that whether there's allowed to be undefined data in the array behind a &mut [u8] is a distinct question from whether u8 is allowed to be undefined data. Similarly, we haven't yet defined the behavior, but might end up defining it.

1 Like

Well, it's the kind of thing that official documentation suggested to do in certain situations, so exactly what category it's in is hard to tell.

I certainly agree that today one should be using MaybeUninit<u8> instead.

Great dialogue.

Rust is giving everyone a chance to think differently about what quality and reliable code looks like... to the degree we can "hit the mark" every time... that would be great.

Nonetheless, I heard several rationales for allowing the use of uninitialized memory going something like: "as long as it is behind the scenes/under the hood". However, has history not taught us that what one day is "behind the scenes" is guaranteed to be less so over time. I believe this is what happened with SimCity; it was using free'd memory "behind the scenes"... then came the next version of the Windows OS. Boom. The OS had to maintain use of two methods for allocating memory and a means to manage the dynamic switch to accommodate one of it's most popular games. Ugly and just plain wrong-headed in the first place. I'm sure they had a good reason for thinking it was ok at the time... And thus the trap.

2 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.