UB Questions! What exactly is an "Allocated Object"?

I've been on a quest lately to reduce reliance on undefined behavior (UB) in the Rust dependencies I use. This led me to the following mysterious line in the the documentation for pointer.offset():

Both the starting and resulting pointer must be either in bounds or one byte past the end of the same allocated object. Note that in Rust, every (stack-allocated) variable is considered a separate allocated object.

I haven't heard this notion of an "allocated object" in any material on Rust before, so this unleashed a torrent of questions for me immediately. For instance:

  • How is memory on the heap divided into "objects" - is it per call to malloc, or something else?
  • How does the notion of object boundaries work for elements in an array? Is each element an "object", or is the whole array?
  • Is it only allocations from Rust that are counted as "allocated objects", or does the rule apply to allocations from C as well?
  • If so, what is considered the object boundary for C structs with a flexible-array-member?
  • Is there somewhere I can look to tell if object boundaries are being crossed when offsetting pointers - for instance, inspecting generated MIR, LLVM IR, or memory during debugging?

The motivation for this investigation comes from this issue on the coremidi_sys crate.

In this particular case, I need to do some math on a pointer that I'm handed from a C callback, complicated by the fact that the pointed-to struct is has a flexible array member (so the pointer offset will need to exceed the size of the struct that Rust knows about in some cases :sweat_smile:). My instinct is to just cast the pointer to an integer like the pointer.wrapping_offset() docs suggest and do the math there, just to be sure that UB is avoided. But I'd much rather understand if that is really necessary before lowering the abstraction level of the code.

So that turned out to be a million questions, and I don't expect anyone to answer them all. But if anyone has any insights to share or resources that could be used to learn more about this, that would be much appreciated :pray::pray:

(also posted on Reddit)

2 Likes

When you have a raw pointer, what it can access depends on how you originally created the raw pointer, with casts and offsets not changing that region.

  1. If created from a reference, it can access exactly the things that the reference could reach.
  2. If you got it from the allocator, it can access that and exactly that (heap) allocation.

This is more or less enough to deduce the rest of the rules. E.g. for stack objects, you always get the raw pointer by first creating a reference, which limits the raw pointer to that specific value on the stack.

Pointers in C should be considered the same as raw pointers. For pointers created in C, consult the C specification (i.e. if it isn't UB to access that address in C with the pointer, it's also OK in Rust) .

This means that if you get a pointer to a field by doing &value.field, then you got it from a reference, limiting the pointer to just that field. Similarly if you got it with offset and casting, then the pointer is still valid the for the rest of the fields.

Regarding mutation, it can mutate something if and only if the raw pointer originally came from a mutable reference or the allocator.

Note that converting a raw pointer into a reference and then converting the reference back can indeed result in a smaller region of validity.

5 Likes

I believe this should be a "logical", rather than "physical" distinction. For example, a couple of months ago, a question about unsafe code clarified that it is undefined behavior to do the following:

let vec: Vec<u32> = vec![1, 2, 3, 4];
let first_ptr = &vec[0] as *const u32;
let last_ptr = first_ptr.offset(3);
let last = unsafe { *last_ptr };

Because even though first_ptr physically points to the beginning of the vector, it was created from a reference to a single element, and thus it's illegal to use it for accessing other elements of the vector. Yet, the vector is required to allocate its buffer as a single contiguous call to memory, so as a consequence, I think "allocated object" can't possibly refer to "per call to the allocator" – it would be insufficient.

2 Likes

Yes, this is exactly the thing I was trying to highlight with

If created from a reference, it can access exactly the things that the reference could reach.

and

Note that converting a raw pointer into a reference and then converting the reference back can indeed result in a smaller region of validity.

1 Like

Hey @alice, thank you so much for the excellent breakdown of the rules! And thank you @H2CO3 for the clarification. This was super helpful :clap:

1 Like

Small follow up here. I've been using MaybeUninit.first_ptr_mut lately in one of my projects (on an an owned Vec<MaybeUninit<u8>>), and after learning about these rules I got worried that my pointer arithmetic I was doing on that pointer to get at the nth element could be UB, but the documentation does not tell me wether or not the pointer is valid for the entire region covered by the slice. Checking the source of the function (luckily this is possible) do however reveal that a pointer is created to the entire slice, then just cased to a pointer for a specific item, so I assume this means it's ok to do things like ptr.add(5) to get at the 6th element. However, should maybe the documentation say that this is the case?

Yeah that is a bit ambiguous, but I think it is fine to assume that functions like that which returns a raw pointer are valid for the entire array.

The key to distinguish &buf[0] as * {const|mut} _ from buf.as_...ptr() is to think about the empty slice case: the former goes out of bounds and thus cannot succeed (either panicking or UBing), whereas the latter is well defined (an allocation can span across 0 bytes and thus have an associated ptr).

Given that the documentation of MaybeUninit.first_ptr_mut() does not mention a panic or unsafety when used on an empty slice, then it cannot go through a single-element reference (although technically it could subslice the slice to reduce the provenance, by special-casing non-empty slices, so the documentation still needs to be improved in that regard to guarantee lack of so doing; if this bothers you, in the meantime you can do .as_mut_ptr().cast::<T>() which has non-ambiguous semantics).

2 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.