Provenance when reading pointers from erased type

I'm working on a piece of code that would be nice to eventually run under Miri, which is why I'm interested in provenance topic.

Essentially what I have is opaque NonNull<c_void> and a bit of serialized metadata in a custom format explaining how to make sense of the contents behind that pointer in the runtime.

Some fields of the struct that pointer pointing at are in turn pointers themselves, something like this:

struct S {
    data1_ptr: NonNull<Data1>,
    data2_ptr: NonNull<Data2>,
    data1_size: u32,
    data2_size: u32,
}

Every field of S is described non-ambiguously by metadata and can be read byte by byte relatively to the base pointer, but the provenance will be lost.

Now I'm curious if there is any way to retain provenance in such cases?

I don't think placing duplicating usize provenance next to the pointer itself like this will help since provenance is compile-time thing:

struct S {
    data1_ptr: NonNull<Data1>,
    data1_provenance: usize,
    data2_ptr: NonNull<Data2>,
    data2_provenance: usize,
    data1_size: u32,
    data2_size: u32,
}

I hope what I'm writing here makes some sense.

Do you know that it currently doesn't run under Miri?

How are you going about the reading? Are you sure you can’t use std::ptr::read::<NonNull<Data1>>() to read the field? A dynamic layout doesn’t necessarily mean you can’t use provenance-preserving operations.

Adding usizes cannot help you regarding provenance, because integers do not have provenance. I think you have some misunderstanding about what provenance is, but I am not sure. Are you perhaps confusing provenance with DST pointer metadata?

I do not (the thing isn't even built yet to try it out). At the same time -Zmiri-strict-provenance will fail unless provenance is clear and I am quite confident that reading bytes from c_void and interpreting them as pointers directly is not going to satisfy strict provenance requirements. And unless -Zmiri-permissive-provenance is used I expect to get a warning and I do not tolerate compiler warnings in my Rust software.

I do not actually know the type at this location. Instead the pointer will be read as NonNull<c_void> and passed on further to the code that does know what it is and can turn it into proper NonNull<Data1>, but Data1 type is not in the scope of the generic code that reads the pointer originally.

What I'm essentially working on is a generic code execution environment, which can deal with native code (what this thread is about) and various guest VMs (think WebAssebly for example, where pointers to VM's memory are just numbers from host's point of view and don't even have to be the same size as host pointers). The metadata description is generic and handles both cases. Code can be compiled to different targets, if native (for testing and debugging purposes) then it runs in the same process as host, hence desire to have strict provenance and check everything that is possible to check. But the execution environment (if you can call it that) doesn't know and doesn't depend on the specific code that is running.

Right, that makes sense, it would imply runtime checks otherwise.

casting a pointer does NOT lose the provenance.

VM pointers are not rust pointers, so they don't need to (but can, depending on the implemention of the VM) have provenances, they can be just "addresses", whatever that means is up to the VM implementation. as long the VM is implemented correctly, these "addresses" can be translated into valid host pointers for the interpreter to use.

it's nothing to do with integer vs pointers. provenance are defined by the rust semantics. they will only be reified if the target (could be a "virtual" target like miri, or "real" target like the CHERI architecture) supports it. on most targets, they will be completely erased at runtime, and even on supported hardware architectures, they are designed to have negligible overhead.

1 Like

I'm not just casting pointer, in original example I don't even have S, let alone Data1 types available to the code, only their description as metadata in a custom format. So if metadata says that there is a pointer to something at the beginning of the memory behind the pointer, then "execution environment" (if you can even call it that) will read the pointer from there and write it to another memory buffer that a different piece of code will later cast to an actual data structure.

The whole workflow is roughly something like this pseudo-code:

// One crate
struct S {
    data1_ptr: NonNull<Data1>,
    data2_ptr: NonNull<Data2>,
}

// Different crate that doesn't have access to either `S` or `S2`,
// just has metadata about data layout of both data structures that
// allows it to do the following transformation in the runtime:
fn convert(s_ptr: NonNull<c_void>, s2_buffer: &mut Vec<u8>) {
    let data1_ptr: *const c_void = ptr::read(s_ptr);
    let data2_ptr: *mut c_void = ptr::read(
        s_ptr.byte_offset(size_of::<usize>()),
    );
    s2_buffer.extend((data2_ptr as usize).to_ne_bytes());
    s2_buffer.extend((data1_ptr as usize).to_ne_bytes());
}

// Yet another crate
struct S2 {
    data2_ptr: NonNull<Data2>,
    data1_ptr: NonNull<Data1>,
}

Now how do I prove to the compiler that S2.data1_ptr comes from S1.data1_ptr?

that's perfectly valid, the metadata is essentially an "external" discriminator.

as long they are within the same memory space, the type will work it out.

you cannot, as written like this snippet. when you do data1_ptr as usize, you discarded the provenance information.

note, with strict provenance, you will NOT be able to roundtrip between pointers and integers. you cast pointers to integers with the "exposed provenence" API, but it is NOT conformant to strict provenance.

for strict provenance, you must stick to pointer types, it doesn't matter you "cast" directly between pointer types, or indirectly via type-punning it through memory.

in essence, you will not have strict provenance if you serialize pointers with raw bytes Vec<u8>, as to_ne_bytes() is not available for pointers.

But note that “will run under Miri” does not require “compliant with Strict Provenance”. Miri will run (with a warning) this program, which uses exposed provenance, which is outside the Strict Provenance subset:

use std::ptr::with_exposed_provenance;

fn main() {
    let s1: &&str = &"hello world";
    let p1: *const &str = s1;
    let mut v: Vec<u8> = Vec::new();
    v.extend(p1.expose_provenance().to_ne_bytes());
    
    let p2: *const &str = with_exposed_provenance(usize::from_ne_bytes(*v.first_chunk().unwrap()));
    let s2: &&str = unsafe { &*p2 };
    println!("{s2}");
}

Also, you can have an untyped buffer which doesn’t require passing pointers through usize — you just have to not use Vec<u8>, and use only ptr::write() and ptr::read() with raw pointers into the buffer to read and write your pointer data from and to the buffer. (Or maybe someone’s written some safer wrapper around doing that. I’m not familiar with the area.)

3 Likes

From Ralf Jung's blog:

The right type to use for holding arbitrary data is MaybeUninit , so e.g. [MaybeUninit<u8>; 1024] for up to 1KiB of arbitrary data. MaybeUninit can also hold pointers with their provenance without any trouble.

(Vec<MaybeUninit> should be fine as well)
You should probably then create a pointer to that buffer, then cast that pointer to whatever type you want to load and then do a load and i believe provenance will be preserved.

If it is just an integer that is a problem, then this is okay for strict provenance?:

// One crate
struct S {
    data1_ptr: NonNull<Data1>,
    data2_ptr: NonNull<Data2>,
}

// Different crate that doesn't have access to either `S` or `S2`,
// just has metadata about data layout of both data structures that
// allows it to do the following transformation in the runtime:
fn convert(s_ptr: NonNull<c_void>, s2_buffer: &mut Vec<u8>) {
    s2_buffer.reserve(size_of::<usize>() * 2);

    let data1_ptr: *const c_void = ptr::read(s_ptr);
    let data2_ptr: *mut c_void = ptr::read(
        s_ptr.byte_offset(size_of::<usize>()),
    );

    s2_buffer.as_mut_ptr().write(data2_ptr);
    s2_buffer
        .as_mut_ptr()
        .byte_offset(size_of::<usize>())
        .write(data2_ptr);

    s2_buffer.set_len(s2_buffer.len() + size_of::<usize>() * 2);
}

// Yet another crate
struct S2 {
    data2_ptr: NonNull<Data2>,
    data1_ptr: NonNull<Data1>,
}

In this case I read and write pointers a pointers without casting them to integers first.

You load and store pointers without casting them to integers, but you store them "in u8s" if that makes sense.

From my understanding of the blog i linked that still loses provenance as a u8 can't "store" it. That's why you should use MaybeUninit. (maybe a custom defined union would work as well, but i'm not sure there) It can "store" the provenance of the pointer you write to it.

That is exactly my challenge, I have to "manually" construct a data structure in aligned memory buffer (that is "just u8s") before calling unknown unsafe extern "C" fn(NonNull<c_void>) that will finally make sense of it.

Why does your memory buffer have to be just u8s? Why can't you use MaybeUninit<u8> in your buffer? It should't make a difference if you access it via raw ptrs anyways.

Instead of Vec<u8> you use Vec<MaybeUninit<u8>>

It could, that was just a pseudo-code, Vec::<u8>::spare_capacity_mut() returns &mut [MaybeUninit<u8>], which is also the same memory being accessed through .as_mut_ptr(), just with a different type (and not necessarily just spare capacity).

Even then it still feels under-specified. For example what if I have Vec<usize>, can I write pointer into MaybeUninit<usize> in that case with the same result?

I believe we are talking about different things right now. I don't suggest you only use MaybeUninit while writing the Data into it (like with spare_capacity).
I suggest that you always use MaybeUninit for your storage. your pseudo function would take a &mut Vec<MaybeUninit<u8>>. Every other function that interacts with that storage also uses MaybeUninit.

You never have Vec<usize> you always have it with the MaybeUninit.

So even when you have "initialized" the data you still use MaybeUninit, because your data isn't a u8 but instead some ptr.

I see, that makes a lot more sense now. Basically indicating to the compiler that we're not intending to read that memory as integers at all. A bit counter-intuitive though.

And then at the last step when I need S2 pointer should I do something like this?:

fn process_s2(s2_buffer: &[MaybeUninit<u8>]) {
    let s2_ptr = s2_buffer.slice_as_mut_ptr().cast::<S2>();
    // ...
}

your reasoning is probably right, but the example code contains some errors. the following is an example how to load and store pointers (preserving provenance) using untyped raw bytes.

note, MaybeUninit solves different problems and is irrelevant for provenance discussion.

/// SAFETY:
/// `s_ptr` must points to valid data with the layout described by the metadata
/// `s2_buffer` must have enough space to store data with the layout described by the metadata
/// when this function returns, the raw memory by `s2_buffer` might contain uninitialized data
unsafe fn convert(s_ptr: NonNull<c_void>, s2_buffer: NonNull<u8>) {
    // use byte pointer for convenient offset calculation
    let s_ptr: *const u8 = s_ptr.as_ptr() as _;

    // suppose `s_ptr` points to data equivalent to this layout:
    #[repr(C)]
    struct S {
        data1_ptr: NonNull<Data1>,
        data2_ptr: NonNull<Data2>,
    }

    let data1_ptr: NonNull<Data1> =
        unsafe { std::ptr::read_unaligned(s_ptr.add(std::mem::offset_of!(S, data1_ptr)) as _) };
    let data2_ptr: NonNull<Data2> =
        unsafe { std::ptr::read_unaligned(s_ptr.add(std::mem::offset_of!(S, data2_ptr)) as _) };

    // suppose `s2_buffer` can store data equivalent to this layout:
    #[repr(C)]
    struct S2 {
        data2_ptr: NonNull<Data2>,
        data1_ptr: NonNull<Data1>,
    }

    let s2_ptr = s2_buffer.as_ptr();
    unsafe {
        std::ptr::write_unaligned(
            s2_ptr.add(std::mem::offset_of!(S2, data1_ptr)) as _,
            data1_ptr,
        );
        std::ptr::write_unaligned(
            s2_ptr.add(std::mem::offset_of!(S2, data2_ptr)) as _,
            data2_ptr,
        );
    }
}

Yes something like that. Then you need to be careful about alignment (or use read_unaligned) and references and their requirements with raw ptrs.

So for example i don't think you are allowed to take a shared reference as a parameter and then mutate the data behind it. maybe you can't even create a mut ptr but i'm not sure there.

I based my statement that MaybeUninit has an effect on Provenance on this article by ralf jung, who has designed miri. It is pretty clear in this regard and i don't believe that anything has changed since that was written.

Yes, I do have alignment requirements taken care of there.

That is a good point, should be &mut there.

That seems to be mostly the same code that I had, just with explicitly unaligned reads and writes that I will ensure are aligned on both ends, so don't need to worry about (though good to be explicit about it).

Could you elaborate on this?