How to pass a repr(C) enum to a C function that will hash it?

I have an enum that I need to pass to a C function that will hash the entire structure (not just the individual values). The enum is similar to this:

#[repr(C)]
enum Test {
    One(u8),
    Two
}

The issue is that the second option (Two) will often be padded with garbage and two of these enums that are logically equal to each other will hash differently.

I looked into getting the size of a variant and then zeroing out the remaining memory, but I there seems to be no way to find the ending address of a variant.

Is there any way to do this or do I need to manually create and set a padding integer/array to make sure each variant is the same length?

I looked into getting the size of a variant and then zeroing out the remaining memory, but I there seems to be no way to find the ending address of a variant.

There can be padding gaps in the middle, too, so you have to check all the fields of the variant.

Is there any way to do this or do I need to manually create and set a padding integer/array to make sure each variant is the same length?

There's probably a macro library to do this, but I don’t know of one. But if you do use explicit fields to hold the zeroes, a good way to do it is:

#[repr(u8)]
enum Zero { Zero = 0 }

#[repr(C)]
enum Test {
    One(u8),
    Two([Zero; 1])
}

This way it’s impossible to for those bytes to contain anything but zeroes.

2 Likes

If you can use Option<NonZeroU8>, it is guaranteed to have the same layout as NonZeroU8 -- i.e. as u8.

These explanations of reprs may be useful if you go the manual route.

it seems you are hashing raw bytes, not enum Test, the rust type. even if it's marked as #[repr(C)], it is still a rust type.

if you want to hash raw bytes, that's fine, just use &[u8]; just don't use the wrong type Test.

instead, you should define your custom conversions:

unsafe extern "C" {
    unsafe fn hasher(ptr: *const u8, len: usize) -> u64;
}
const TEST_SIZE: usize = 8;
impl Test {
    fn to_bytes(&self) -> [u8; TEST_SIZE] { todo!() }
    // optionally, also implement the inverse conversion
    //fn from_bytes(bytes: [u8; 8]) -> Self { todo!() }
}

let test: Test = todo!();
unsafe {
    let hash = hasher(test.to_bytes().as_ptr(), TEST_SIZE);
    println!("hash of test: {hash}");
}
2 Likes

I don't have control over the hashing at all, that is done by a C function that I call with this struct.

This does make more sense to enforce the padding is 0, still not very ergonomic but I feel like this might be as good as it gets.

This isn't the actual struct, just a simple example, the real one has many variants with structure fields.

I did take a look at the book on reprs but I didn't see anything about how to force padding to be zeroed.

it might be true that the C function is lying about the parameter type (e.g. it was declared as some struct Foo, but internally treated as an array of unstructured bytes), but that's not the point. (although if it did so, it is a badly written C function: C has padding bytes too)

my point is, there's a high level abstraction type (rust enum) and a low level ffi representation type (some #[repr(C)] type). the goal of the former could be ergonimic APIs, strong safety guarantees, etc, while the purpose of the latter is to make sure the data will NOT be misinterpreted by both side of the ffi boundary.

it would be great if a single type definition can serve these two purposes at the same time, but it is NOT always possible, for example, sometimes the high level type uses language features that are not ffi compatible, or the ffi representation couldn't guarantee certain safety invariant or was too verbose to have an ergonomic API, etc.

it is better to add a conversion at the ffi boundary, so you have a clear separation between high level and low level concerns.

in this example, it doesn't matter if the ffi representation is [u8; SIZE], or it is some convoluted struct/union/enum with handcrafted memory layout and/or carefully placed padding bytes. all it matters is, it agrees with how it will be intepreted by the other side of ffi boundary.

4 Likes

There isn't any such mechanism built into repr. But knowing the actual layout is necessary to avoid padding manually.

Would starting with a zero filled MaybeUninit be a solution?

No, because writing the value into the MaybeUninit may or may not copy uninit bytes into it too.

If a type has padding, the only way to write a value of that type into a buffer and guarantee there are no uninit bytes afterward is to explicitly copy each of its fields (and enum discriminant) as a separate operation. (And each of those fields must be copied field-by-field if they contain padding.)

Note that this can also be understood as a “serialization” operation that merely happens to have a format compatible with the type layout.

(I wonder if anyone has written a macro library to do this.)

3 Likes

Could you expand on this a bit? I tried to manually copy each field but it did not work with the enum, copying the smaller variant always came with uninit padding.

Have you followed the official recomentation? After all currently unsafe code must be used to inspect the discriminant of an enum with fields wasn't written just for fun!

You would need to declare two enums (one with paylod, one without payload), make every field it's own struct and then copy only what you want to copy.

Padding is padding. Period. Full stop. To ensure that there's something sensible in the bytes you should never allow compiler to touch them via Test type. Use some different type that doesn't have padding. Language provides all the tools necessary, although they are real PITA to use.

What's to explain here? Everything was explained already:

Rust's definition of enum Test very much includes information about the fact that certain parts of it can be ignored.

To pass something to C side with guaranteed content you have to guarantee that this part of memory is never considered, on the Rust side, a Test type. Because every time compiler can see that you are writing Test variant with padding it assumes that these bytes that belong to the padding can have arbitrary value.

That's inalienable part of type Test definition.

You must

  • copy the bytes of the discriminant
  • copy the non-padding bytes of each field
  • zero all bytes that are neither field nor discriminant (or just zero everything first)

One property this technique has is that you can implement it in safe code, which guarantees no UB. (The part that is unsafe is then re-interpreting the produced bytes as a Test, but writing them is safe.) Here is a demonstration of that:

use core::{ptr, mem};

#[repr(u8)]
#[derive(Debug, PartialEq)]
enum Test {
    One(u8) = 1,
    Two = 2,
    Three(u8, u32) = 3,
}

fn copy_field<T, F, const N: usize>(buf: &mut [u8], whole: &T, field: &F, field_bytes: [u8; N]) {
    const {
        assert!(size_of::<F>() == N);
    }
    let offset = ptr::from_ref(field).addr() - ptr::from_ref(whole).addr();
    buf[offset..][..N].copy_from_slice(&field_bytes);
}

impl Test {
    fn to_bytes_no_uninit(&self) -> [u8; size_of::<Self>()] {
        let mut buf = [0; size_of::<Self>()];
        let discriminant = match self {
            Self::One(x) => {
                copy_field(&mut buf, self, x, x.to_ne_bytes());
                1
            }
            Self::Two => 2,
            Self::Three(x, y) => {
                copy_field(&mut buf, self, x, x.to_ne_bytes());
                copy_field(&mut buf, self, y, y.to_ne_bytes());
                3
            }
        };
        buf[0] = discriminant;
        buf
    }
}

fn main() {
    let bytes = Test::Three(0xAA, 0xBBCCDDEE).to_bytes_no_uninit();

    // Note: This assertion only passes given little-endian and
    // align_of::<u32>() == 4, which is typical but not guaranteed.
    // Other architectures will have different layouts.
    assert_eq!(bytes, [0x03, 0xAA, 0, 0, 0xEE, 0xDD, 0xCC, 0xBB]);
    
    assert_eq!(
        unsafe { mem::transmute::<[u8; size_of::<Test>()], Test>(bytes) },
        Test::Three(0xAA, 0xBBCCDDEE),
    );
}

to_bytes_no_uninit() will always produce a fully initialized byte array whose contents have the same layout as a Test value, with all bytes that are padding in the layout initialized to zero instead. It could be made much more elegant with a trait and even a derive macro, but this is the basic principle. It could be made shorter with more use of unsafe code, but if you choose to do that, you have to be careful that you’re never copying any padding anywhere. This code can be confident it never copies padding because it always uses .to_ne_bytes() on the primitive field types. If there were nested enums/structs in Test, it would similarly define functions like Test::to_bytes_no_uninit() for each enum/struct involved.

6 Likes