ABI and MaybeUninit

In an existing thread, @CAD97 makes the following comment:

Would you be able to elaborate what this means in the Rust abstract machine? Does writing 0x02 to an MaybeUninit<bool> just result in an uninitialized value similar to MaybeUninit::uninit(), or is there something more we can say?

1 Like

So memory itself is untyped, even on the stack. When you write 0x02_u8 to a byte, that byte contains 0x02_u8, and reading it will produce that value.

Where the potential pitfall is, is when you make a typed copy of that value โ€” that is, you read the memory as a MaybeUninit<bool>.

In RalfJ's MiniRust experimental partial formalization, a typed copy consists of essentially decoding the AM-bytes into the abstract value then encoding that abstract value back to AM-bytes at the new location.

So it's a question of defining the decode and encode steps for bool and for MaybeUninit<bool>.

bool is simple enough (minirust/values.md at master ยท RalfJung/minirust ยท GitHub pseudo-Rust):

impl Type {
    fn decode(Type::Bool: Self, bytes: List<AbstractByte>) -> Option<Value> {
        match *bytes {
            [AbstractByte::Init(0, _)] => Value::Bool(false),
            [AbstractByte::Init(1, _)] => Value::Bool(true),
            _ => throw_ub!(),
        }
    }
    fn encode(Type::Bool: Self, val: Value) -> List<AbstractByte> {
        let Value::Bool(b) = val else { unreachable!() };
        [AbstractByte::Init(if b { 1 } else { 0 }, None)]
    }
}

The question then is how is MaybeUninit's encode/decode defined, and how does that match with how rustc lowers it to the concrete machine's ABI?

The simple and desirable definition for #[repr(Rust)] union is that encode/decode just copy the AM-bytes directly.

But unfortunately MaybeUninit<T> is more complicated, because we want to lower it with the ABI of T. The simple answer for decode is then to decode { .init: T }, but if that fails, decode { .uninit: () }. This would then only preserve through a typed copy AM-bytes which are valid for the wrapped type's decode.

And some amount of this may be necessary โ€” at a minimum, internal padding can be nonpreserved at the ABI level, so passing MaybeUninit<T> of such a value cannot preserve the padding bytes if it's passed with the ABI of T.

The linked discussion is more about #[repr(Rust)] union but includes some discussion about #[repr(transparent)] union (but more specifically MaybeUninit) as well.

3 Likes

At the moment the following code:

fn main() {
    let two = two();
    println!("{}", unsafe { two.as_ptr().cast::<u8>().read() })
}
use std::mem::MaybeUninit;
fn two() -> MaybeUninit<bool> {
    let mut uninit = MaybeUninit::<bool>::uninit();
    unsafe { uninit.as_mut_ptr().cast::<u8>().write(0x02) };
    uninit
}

Playground

passes the miri check and prints the expected value of two. However, outside of a union, I wouldn't recommend relying on this fact. While a bool conveniently has the same abi as a u8, so thus treating a MaybeUninit<bool> as a MaybeUninit<u8> is just fine, this isn't true with uninitialized memory in general at all.

In particular, padding bytes are much less safe to move around this way. Even if moving them is well-defined (I'm only mostly convinced this is the case) function calls deinitialize the padding bytes because the behavior of the padding bytes is unspecified in every abi I know of. To treat an arbitrary type as a slab of potentially uninitialized bytes and vice versa, the only safe way to go is using a union:

union BytesOrT<T>{
     val: T,
     bytes: [u8; std::mem::size_of::<T>()]
}

Note that it is at least somewhat likely that #[repr(Rust)] unions will get the "bag of bytes" model where doing a typed copy is defined to preserve all of the abstract bytes as-is. (It's the fact that MaybeUninit is ABI-transparent that makes it interesting.)

If they don't get that semantic, and instead have e.g. a bytewise validity union, then that union isn't sufficient to get a byte bag representation, because while it does ensure all initialized bytes must be preserved, it does not necessarily make an uninit byte valid for each byte, nor does it require preserving provenance.

I think it's essentially decided that MaybeUninit<u8> is capable of copying any abstract byte while maintaining its full state. As such, if #[repr(Rust)] unions don't just natively have the bag-of-bytes representation, I believe the foolproof way of getting that representation is to union with [MaybeUninit<u8>; size_of<T>()].

However, the following program will not, because of the niche-filling optimization (playground):

fn main() {
    let two = two();
    println!("{}", unsafe { two.as_ptr().cast::<u8>().read() })
}
use std::mem::MaybeUninit;
fn two() -> MaybeUninit<bool> {
    let mut uninit = Some(MaybeUninit::<bool>::uninit());
    unsafe { uninit.unwrap().as_mut_ptr().cast::<u8>().write(0x02) };
    uninit.unwrap()
}

Edit: my bad, theres a missing as_mut() and so the corrected program does actually print 2.

That's actually not why, MaybeUninit<T> explicitly forbids niche optimizations. What's actually happening is that MaybeUninit<T: Copy> implements Copy. With the call to unwrap you create a duplicate and mutate that leaving the original unchanged. Adjust your example to operate in place and everything works just fine:

fn main() {
    let two = two();
    println!("{}", unsafe { two.as_ptr().cast::<u8>().read() })
}
use std::mem::MaybeUninit;
fn two() -> MaybeUninit<bool> {
    let mut uninit = Some(MaybeUninit::<bool>::uninit());
    // new:        vvvvvvvvv
    unsafe { uninit.as_mut().unwrap().as_mut_ptr().cast::<u8>().write(0x02) };
    uninit.unwrap()
}
2 Likes

The more relevant issue is this one, but you are right, should "bag of bytes" not wind up being the default, the version I originally provided would not necessarily be sound (exact details depend on T and what the rules around unions actually wind up being). Using the [MaybeUninit<u8>; size_of::<T>()] would be one solution, but an equally valid (and probably more convenient) version would be to additionally add a third zero-sized field:

union BytesOrT<T>{
     val: T,
     bytes: [u8; std::mem::size_of::<T>()],
     uninit: (),
}

This is the same magic that makes MaybeUninit<T> work, and does so because all bytes are allowed to be padding (and thus uninitialized) with the ZST in there.

2 Likes

While this is the current implementation of MaybeUninit (and I have no reason to believe that it wouldn't continue to work as such), I don't believe that to be as resilient โ€” keep in mind that in this theoretical specification of union it doesn't preserve padding bytes, so allowing all bytes to be padding doesn't add any requirements to maintain values through a typed copy.

(Also, MaybeUninit's current definition is, as is core to the discussion, currently #[repr(transparent)], so it could theoretically have meaningfully different semantics w.r.t. what bytes get preserved to #[repr(Rust)] unions.)

The rules around what a typed copy of a union actually preserves when it's not bag-of-bytes is extremely subtle. No possible variant has bytes that care about provenance in your version either, so why should a typed copy be preserving provenance?

I both hope and expect that bag-of-bytes (no padding) will be the semantics for #[repr(Rust)] union because any[1] stronger model seems to me to be much too surprisingly complicated for marginal benefit. union just being a way to put names to what essentially behaves as typed views of a named memory region is very simple to teach.


  1. Perhaps until reaching C++'s "active variant" rules... but we have dataful enums already, we don't need union to serve that role. The other reasonable choices I know are (medium) to not preserve bytes which are always padding, or (strict) to require all bytes to be valid for at least one variant (all bytes are valid in padding). โ†ฉ๏ธŽ

2 Likes

Sure, padding bytes are not necessarily guaranteed to transfer, but because there's also the byte array in there, a bit-for-bit reproduction is required, because every byte could be padding, or every byte could be meaningful. In other words, every byte is simultaneously padding and meaningful, whereas the array of possibly uninitialized bytes makes every byte meaningful padding.It's a difference without a distinction, other than ease of use.

Additionally, if T has meaningful provenance, then a union with T as one of the members must also carry that provenance, otherwise, simply writing T to the union and immediately reading it back out would be undefined behavior. I don't see where else provenance comes into play, other than the usual caveats involving mixing pointers with byte arrays apply.

Going with the "all unions are a bag of bytes unless otherwise explicitly marked" is definitely the way to go, though. Permitting niche optimizations by default is too error-prone, and if the memory is that important allowing internal tags is likely much more useful than auto-generated niche optimizations anyway. That's not even considering the nonsense that is C++'s active member rules, which, just like type-based pointer aliasing analysis, is a footgunny crutch used so compilers can have decent codegen in a language with very few hard rules. Rust doesn't need any of that because the semantics needed for good codegen are baked into the language itself in a safe and intuitive manner.

From this discussion, it sounds like the best way to get a bag of bytes with the size and alignment of a given type is the following?

union BagOfBytes<T>{
     val: T,
     bytes: [MaybeUninit<u8>; std::mem::size_of::<T>()],
}
2 Likes

I think it needs to be #[repr(C)], as "Fields might have a non-zero offset".

2 Likes

No need, scroll up and the reference also says "[the] size of a union is determined by the size of its largest field." That means the largest field must not have a nonzero offset, and in that example, both fields are the largest.

1 Like