Derefencing out of bounds under miri

I'm trying to read data that comes in messages of the following format

// A single zero byte signals a special message (outside of scope here)
0x0
// A message starting with a byte 1 <= tag < 0x80 is a "small" message
// and is immediately followed by that number - 1 of bytes
0xXY | message bytes
// A large message starts with u32 encoded in big-endian, and the first bit
// of the first byte set, followed by the indicated number - 4 of bytes
0x8Y 0xYY 0xYY 0xYY | message bytes

// Example
// [12, 104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]
//  ^^ a small message of 12 bytes
//      ^--------------------payload-----------------------^
//                 decodes to b"hello world"

My first though was to encode this as an enum, but no repr(_) annotation seems to encode the correct choice of tag here, so I went with the following union instead

#[repr(C)]
union Payload {
    small: ManuallyDrop<SmallPayload>,
    large: ManuallyDrop<LargePayload>,
    special: SpecialPayload,
}
#[repr(C)]
struct SmallPayload {
    size: u8, // invariant: 0 < len < 0x80
}
#[repr(C, align(4))]
struct LargePayload {
    size: [u8; 4], // first bits of [0] are 0b1...
}
#[derive(Clone, Copy, Debug)]
#[repr(C, u8)]
enum Marker {
    SpecialPad = 0b0000_0000,
}
#[derive(Clone, Copy, Debug)]
#[repr(C)]
struct SpecialPayload {
    marker: Marker,
}

It is obviously incorrect to just construct values here, since the message bytes are out-of-band and need to be accessible immediately after the Payload structure. I have a load method verifying this (panics if assumptions are violated, real implementation returns an Error).

fn load(bytes: &[u8]) -> &Payload {
    assert!(bytes.as_ptr().addr().is_multiple_of(align_of::<Payload>()), "wrong alignment");
    let size = bytes[0];
    if size < 0x80 { // small
        let _payload = &bytes[1..size as usize];
    } else if size > 0 { // large
        let size = u32::from_be_bytes(bytes[0..4].try_into().unwrap()) & 0x7FFF_FFFF;
        let _payload = &bytes[4..size as usize];
    }
    unsafe { &*(bytes as *const _ as *const _) }
}

I then provide a wrapper to access the message from a message, for example for a small message, I have

impl SmallPayload {
    fn message(&self) -> &[u8] {
        let message_slice = 1..self.size as usize;
        let as_bytes = unsafe { std::slice::from_raw_parts((self as *const Self).cast(), message_slice.end) };
        &as_bytes[message_slice]
    }
}

Full playground link.

Now, the problem: miri does not agree with me.

error: Undefined Behavior: trying to retag from <512076> for SharedReadOnly permission at alloc37223[0x1], but that tag does not exist in the borrow stack for this location

As far as I understand it, the problem is that the reference to a Payload (and subsequently SmallPayload) does not get to read from the bytes past its end, even though these are in the same allocation and - in spirit at least - part of the payload object.

How should I model this to make miri agree? For the same matter, how does rkyv achieve this for its ArchivedString anyway?

It's the header problem.

Using raw pointers and PhantomData<&'for_the_lifetimes ()> is one (general) fix.

You could also rework Payload to be something like

// Perhaps optional if always implied by length or you just don't
// need the distinction post-deserialization
enum PayloadKind {
    Small,
    Large,
    Special,
}

struct Payload<'a> {
    kind: PayloadKind,
    data: &'a [u8],
}

Or an enum itself, etc.

1 Like

I have found this rkyv issue, which points out that MIRIFLAGS="-Zmiri-tree-borrows" accepts the code as written.

Returning a structure like struct PayloadRef<'a>(PhantomData<&'a ()>, *mut Payload) instead of &'a Payload from load might be workable. Thanks for the reference to the rust-lang issue! It will probably compose poorly with other code, but with a bit of squinting ...

Storing an extra (derived) pointer to just the data is a non-starter for me, as I'd like to zero-copy deserialize these as part of a bigger structure, so the extra pointer that would need to be materialized for every Payload won't work - there is no space to store it.

1 Like

Further attempts of doing the right thing (following the discussion in the linked issue), leads me to believe that exposing the provenance to the whole payload in load via

    let ptr = bytes.as_ptr();
    ptr.expose_provenance();
    unsafe { &*(ptr as *const Payload) }

and then "restoring" this provenance when reinterpreting it as a message

        let this = std::ptr::with_exposed_provenance::<Self>((self as *const Self).addr());
        let as_bytes = unsafe { std::slice::from_raw_parts(this.cast(), message_slice.end) };

is a minimally invasive change that accepts the code also under Stacked Borrows. The only trace are warnings that miri is unable to guarantee safety - though I'm under the impression that I am correctly using the API as intended to restore a pointer with provenance to the message bytes.

(Just brainstorming in this comment, I may not fully understand your use case.)

You can still just store a &[u8], I think. After parsing the length, store a &[u8] from the beginning to the end as indicated to the length (and whatever you need for special, so long as it contains the first byte). Reading the first byte tells you which "variant" you have; you could have new types for the variants to avoid reading the byte if you like. You don't have to reparse the length if you know the variant, it's

        match self.kind() {
            PayloadKind::Special => 0, // or whatever
            PayloadKind::Small => self.data.len() - 1,
            PayloadKind::Large => self.data.len() - 4,
        }

And similarly for returning the message slice.


expose_provenance opts out of strict provenance. It's a way to document that yeah, you're doing stuff that Miri can no longer check. Since you care about Miri, I'm not sure that's really what you want...

I'm browsing through alternatives. If miri can verify the correctness, that's the best. If miri can't verify the result (such as without strict prov) but the code is not UB, that's also good. Code under tree borrows that might become UB in the future (such as the originally posted one) are suboptimal. Now, given that opting out strict prov makes it dangerous, I'm still looking for other approaches - but at least I'd not be alone.