Is this function UB?

Hi, I happened to read a function in a crate that seems UB, I want to be sure before opening an issue.
This is a slightly simplified version of the code:

unsafe fn insert<T>(vec: &mut Vec<T>, index: usize, v: T) {
    use std::ptr;

    if vec.len() <= index {
        let delta = index + 1 - vec.len();
        vec.reserve(delta);
        vec.set_len(index + 1);
    }
    ptr::write(vec.get_unchecked_mut(index), v);
}

To me there are two issues, set_len and get_unchecked_mut.

  • set_len's contract asks "The elements at old_len..new_len must be initialized." which isn't the case here
  • even though get_unchecked_mut is cast to *mut T it's still returning &mut T and should point to initialized memory

What do you think?

From https://doc.rust-lang.org/std/vec/struct.Vec.html#method.reserve

Reserves capacity for at least additional more elements to be inserted in the given Vec<T>. The collection 
may reserve more space to avoid frequent reallocations. After calling reserve, capacity will be greater 
than or equal to self.len() + additional. Does nothing if capacity is already sufficient.

So, the problem I see here is that the amount being reserved extra is only 1. What if the size of T is greater than 1? Then, yes, this would be UB it appears

Unless reserve automatically takes into account the size of T?

The issue is whether the vec in the undefined state, between set_len and write, can be observed or dropped. Here there's no panicking operation, early return, etc between the two, and since we have a &mut we know nobody else (i.e. another thread) can see what we're doing. So this looks safe to me in isolation.

It's hard to evaluate one unsafe function outside the context of its module. The operation this function is performing is sketchy, as it's overwriting an element without dropping it. But that might be used safely in the module.

Yes. It isn't in bytes.

2 Likes

Thanks, good to know

AFAIK it's not yet decided whether references have to be valid:

3 Likes

Yes on both accounts.

  • Unless you make it part of the contract of your function that index must not be greater than the current len (i.e. you can insert a new element at the end, but not 2 past the end), then the call to set_len is in clear violation of the contract of the function because the elements in old_len..index are uninitialized. (edit: actually, the contract is still violated even when index = old_len...)

    One could argue that, given that the implementation of set_len is just self.len = new_len, the call is fine so long as those elements are initialized before any unwinding can occur. To which I say: So what? The documented contract of the method is easily upheld by waiting until all elements are initialized before calling the method... so why make things difficult?

  • &mut T to uninitialized memory is a gray area best avoided, and a great deal of it is being eliminated from the standard library as we speak. Maybe Vec could use a fn write_raw(&mut self, index: usize, value: T) helper function that does not drop the old data.

4 Likes

There is no contract on the function but as far as I know the Vec is supposed to stay in a "somewhat" filled state with uninitialized holes.

@cuviper Oh! Good to know thank you

I found this comment from 3 years ago while trying to know if &mut T has to be initialized and all "gray valid" places pointed out are now replaced, that's great.

1 Like

To nitpick, Vec's element containing len initialised elements is a safety invariant and not a validity invariant*, meaning there is no insta-UB when doing that call, even though the Vec is now holding extra uninitialized elements.

This becomes an issue if this unsafe-but-"valid" state gets to be witnessed by any code outside the one in scope. There are mainly three places where this can occur:

  1. if the state is not fixed before the function returns; this is the case of the code shown in the OP, it will thus very probably lead to UB triggering on the caller's code (e.g., when dropping the Vec if ptr::needs_drop::<T>()).

  2. If arbitrary code is called before the function "returns", e.g., if invoking a given closure, or some trait method of T. This can be sneaky since it can happen on a Drop, which in turn can be called early if a panic! occurs. It is not the case in the code shown in the OP (it would have been, for instance, if there was a println! before the ptr::write call! Or if instead of index, the expression index - 1 + 1 had been used, since there could have been a panic! for index = 0 on underflow), since get_unchecked_mut (with a valid index) and ptr::write cannot panic!.

  3. the very code of the function leads to witnessing an invalid state. This is the case of the code shown in the OP, when they get a &mut T at vec.get_unchecked_mut(index) knowing that the pointee has not been initialized yet. As @cuviper mentioned, these rules may get loosened a bit in the future (specially when T is an integer or floating-point type), but until this loosening happens (if ever), this is Undefined Behavior in current Rust.


* Safety vs. Validity invariant

This is best seen with an example. Take the following code:

use ::core::mem;

fn sound (s: &mut str)
{
    // No matter the input, the function is sound to call
    // => no need to mark the function `unsafe`
    unsafe {
        match s.as_bytes_mut().last_mut() {
            | Some(at_last_byte) => {
                let prev = mem::replace(at_last_byte, 0xa0);
                // Danger zone: s points to non-UTF8 bytes
                *at_last_byte = prev;
                // Good, we have re-established the invariant
            },
            | _ => {},
        }
    }
}

unsafe // "may" be UB to call => functions needs to be unsafe
fn unsound (mut s: &str)
{
    let prev = mem::replace(&mut s, mem::zeroed());
    s = prev;
}

Now, both functions do the "same" thing: they break an invariant of the type &str, and then restore it back before anything else happens. The issue lies with the kind of invariants that we are dealing with:

  • a &str (and more generally, &T for any type T), is a Rust reference, which has to be non-null. The compiler may rely on this invariant (and it does!).

    ⟹ It is a validity invariant.

  • a &str is a reference to a slice of bytes (i.e., a &[u8]), with the added property that these bytes must form a valid UTF-8 sequence. Library / programmer code may rely on this invariant, but not the compiler (e.g., the invariant is too complex for the compiler to exploit it).

    ⟹ It is a safety invariant.

The function unsound breaks the non-null validity invariant, thus instantaneously triggering UB, so by the time the null-pointer is erased it is too late: validity invariants must never be broken, not even for an instant!

The function sound breaks the "valid UTF-8" safety invariant, but fixes it right away before any code or anything (besides the compiler) witness this state. And since the compiler does not assume anything regarding the UTF-8 validity, no harm is done: safety invariants can be temporarily broken so as long as only code in scope deals with the broken state, provided this code does not trigger UB in so doing.


Back to Vec, one can see Vec as:

/// # Safety invariant:
///
///   - `raw_buffer[.. len]` holds valid values of type `T`
pub
struct Vec<T> {
    raw_buffer: Box<[MaybeUninit<T>]> // ptr, capacity
    len: usize,
}

impl<T> Vec<T> {
    /// # Safety
    ///
    ///   - the safety invariant must not be broken
    pub
    unsafe
    fn set_len (self: &'_ mut Self, len: usize)
    {
        self.len = len;
    }

    pub
    fn as_mut_slice (self: &'_ mut Self)
      -> &'_ mut [T]
    {
        unsafe {
            // # Safety
            //
            //   - the safety invariant guarantees that this is sound
            ::core::slice::from_raw_parts_mut(
                self.raw_buffer
                    .as_mut_ptr() // : *mut MaybeUninit<T>
                    as *mut T
                ,
                self.len,
            )
        }
    }
}

impl<T> Drop for Vec<T> {
    fn drop (self: &'_ mut Self)
    {
        unsafe {
            // Safety: from safety invariant: we own at least `len` elements
            ::core::ptr::drop_in_place(self.as_mut_slice());
        }
    }
}

So, as you can see, calling .set_len() does not break anything w.r.t. compiler assumptions, as we are just modifying a usize field of some struct. We are however, breaking the safety invariant of the Vec type, which is used to soundly perform other unsafe operations, such as with as_mut_slice casting/transmuting a slice of MaybeUninit<T> to a slice of T, by using and trusting the value of its len field.

In the example of the OP, as_mut_slice was called by the DerefMut impl of Vec, which was implicitly called by . (dot operator) auto-deref. So the funny thing is, no matter what the argument of that .get_unchecked_mut call was (that is, it could have been some value < old_len), it would still have been UB due to temporarily creating a &mut [T] of len index + 1, thus containing uninitialised Ts in it, which (currently) is a violation of a validity invariant and thus instant UB.

6 Likes