Reading c-style structures from disk

EriolAncalagon · January 18, 2022, 7:55pm

First post, using unsafe. Please be kind.

I'm trying to read in c-style structures from binary dumps that were created in C++. They're all POD types, so no worries there. I'm basing my code on this stackoverflow post and with a little modification it works, but I ran into a literal stack overflow because the structures are too large (over 100k, so not a surprise). So I'm modifying them to return a Box<T> instead of just T, and I'm running into having to use alloc and dealloc, which doesn't seem like it would be necessary, but I have a feeling I'm missing something. The code is below:

const BIG_STRUCT_SIZE: usize = 140000;

#[derive(Clone, Copy, Debug)]
#[repr(C)]
#[repr(packed(1))]
struct My_Big_C_Struct{
    // Not implemented the sub-structures yet!
    bulk: [u8; BIG_STRUCT_SIZE - mem::size_of::<u32>()],
    crc: u32, // Every struct has a crc at the end
}

// Reads in the specified struct from the specified path.  If there is not an exact match on size, returns an error
fn read_struct<T, P: AsRef<Path>>(path: P) -> std::io::Result<Box<T>> {
    let path = path.as_ref();
    let struct_size = ::std::mem::size_of::<T>();
    let num_bytes = fs::metadata(path)?.len() as usize;
    if struct_size != num_bytes {
        return Err(std::io::Error::new(
            std::io::ErrorKind::Other,
            format!(
                "Length of file did not match structure specified.  File length: {}, Expected: {}",
                num_bytes, struct_size
            ),
        ));
    }
    let mut reader = BufReader::new(File::open(path)?);
    unsafe {
        let ptr_layout = Layout::new::<T>();
        let raw_ptr = alloc(ptr_layout);
        let buffer = slice::from_raw_parts_mut(raw_ptr, num_bytes);
        match reader.read_exact(buffer) {
            Ok(_) => Ok(Box::from_raw(raw_ptr as *mut T)),
            Err(e) => {
                dealloc(raw_ptr, ptr_layout);
                Err(e)
            },
        }
    }
}

Is there a way to do this with Box where I can just say "make an empty one for the type" and just go? I'm thinking I'm just missing something obvious, because I don't want to have to use alloc and dealloc at all if I can help it.

I'm also concerned that in the comments on the answers Shepmaster says that " it improperly uses unsafe Rust. The proposed function can introduce memory unsafety in safe Rust code". So I'm wondering if the approach is wrong? I don't want to read from the binary dumps field by field, I want the structure definition itself to say the sizes and such, as then there's only one opportunity to get it right/wrong.

Really, all I'm trying to do is get some big binary structures (I can't alter that, they'll be binary) into Rust from disk so I can do math on them. I think this is the way, but I'm open to other suggestions too.

H2CO3 · January 18, 2022, 8:10pm

You could just derive Default on the struct.

EriolAncalagon · January 18, 2022, 8:16pm

You could just derive Default on the struct.

I'm listening, since I control the types in Rust I can do that. I'm not experienced enough with Rust yet to know instinctively how that helps me though. I'm guessing that gets me a Box<T> more easily, but then I need the pointer to read into with from_raw_parts_mut, which I don't know how to do either, and Google is failing me.

Please help a little more if you can.

JoelMon · January 18, 2022, 8:32pm

Hello there!

I had a very similar problem. You can find my post here: Deserializing a .dat binary file created in CPP.

@Michael-F-Bryan's post was extremely helpful and he also wrote a blog post on the topic that I recommend you read: Deserializing Binary Data Files in Rust (WBM).

Hope that gets you on your way!

EriolAncalagon · January 18, 2022, 9:26pm

@JoelMon, I appreciate both your and @H2CO3's responses, but they both have a common problem: you can't #[derive(Default)] on byte arrays over 32 bytes long! You get an E0277 error. This snippet below demonstrates it:

#[derive(Debug, Default)]
struct Test_Defaults {
    bulk: [u8; 40]
}

32 or below, you're good. Above that, not so much. So I'm back with how to initialize the Box<T> with your links as well. In the "real" structures there are binary blob sections larger than 32 bytes, so anything that relies on that #[derive(Default)] won't work. I might be able to actually manually implement Default for the types I need, but that's another level here, and may lead to "gotchas".

I appreciate the clarity of what you linked though for when I need to interpret C-style strings though. That is something real that will happen.

quinedot · January 18, 2022, 10:17pm

There's no ergonomic way in Rust to read in data into uninitialized memory yet. Your current code has UB in that you create a reference to uninitialized data. It's also unsound in that I could call read_struct with any type T, not just those which are POD.

If you're really all POD, you could read the file entire and convert the resulting Vec. (Though your comment about interpreting strings makes me wonder if that's true.)

I included a Default implementation in case it's useful.

kornel · January 19, 2022, 12:32am

Unfortunately, Rust's UB rules say you unleash demons the moment you use from_raw_parts_mut with uninitialized data, so you won't be able to use read_exact. The closest thing you could do is to zero-initialize it using bytemuck's Zero trait. Rust has non-nullable types, so it's not safe to zero-init arbitrary types.

AFAIK this is the only safe way to read a number of bytes into uninitialized memory using Rust's standard library:

reader.take(number_of_bytes).read_to_end(&mut vec);

edit: the function is aware of the capacity, and will try not to exceed it.

The easiest way to make a Box from uninitialized memory is to make Vec::with_capacity(1). You can write to it via raw pointer, set_len(1) and then into_boxed_slice(). But capacity must equal len to avoid copy in into_boxed_slice.

The good news is that these are all known problems, and fixes are on the way. There's Box::new_uninit, and I/O will have a growable buffer abstraction that works with things other than Vec.

In the meantime, libc::fread may do the job

EriolAncalagon · January 19, 2022, 3:12am

@quinedot I appreciate the ideas there, and I've since used Default similarly to initialize the memory. It's wasteful, but it's wasteful one time per object in the scale of the program, when this is going to be a utility (reads a few things once, not frequent disk access), so it's OK to duplicate this.

As for the POD problem, is there a trait bound I could put on T for this? I'd rather do so. I tried searching for such, but couldn't find one, but that could easily be "I'm inexperienced at Rust" rather than "it doesn't exist."

@kornel An interesting discussion on how/why to do this type of thing. I don't know the UB rules (very few rules in fact) which is part of the reason I asked what's "right/wrong" about what I'm doing rather than "it seems to work" for it.

Here's my modified version that does use Default and functions so far with multiple POD types, but POD is not yet enforced (that I know of). Any help with how to do that at compile-time would be great.

const BIG_STRUCT_SIZE: usize = 140000;

#[derive(Clone, Copy, Debug)]
#[repr(packed(1))]
struct My_Big_C_Struct{
    // Not implemented the sub-structures yet!
    bulk: [u8; BIG_STRUCT_SIZE - mem::size_of::<u32>()],
    crc: u32, // Every struct has a crc at the end
}
impl Default for My_Big_C_Struct{
    fn default() -> Self {
        Self {
            bulk: [u8; BIG_STRUCT_SIZE - mem::size_of::<u32>()], // Note: I'd love to say "init to 0, the entire array" rather than specifying
            crc: Default::default()
        }
    }
}

// Reads in the specified struct from the specified path.  If there is not an exact match on size, returns an error
fn read_struct<T: Default, P: AsRef<Path>>(path: P) -> std::io::Result<Box<T>> {
    let path = path.as_ref();
    let struct_size = ::std::mem::size_of::<T>();
    let num_bytes = fs::metadata(path)?.len() as usize;
    if struct_size != num_bytes {
        return Err(std::io::Error::new(
            std::io::ErrorKind::Other,
            format!(
                "Length of file did not match structure specified.  File length: {}, Expected: {}",
                num_bytes, struct_size
            ),
        ));
    }
    let mut reader = BufReader::new(File::open(path)?);
    let boxed_value = Box::new(T::default());
    let ptr_from_box = Box::into_raw(boxed_value);
    unsafe {
        let buffer = slice::from_raw_parts_mut(ptr_from_box as *mut u8, num_bytes);
        reader.read_exact(buffer)?;  // does this cause the ptr_from_box to leak on failure?
        Ok(Box::from_raw(ptr_from_box))
    }
}

This enforces Default on the type, being read in, and initializes the Box<T> contents and uses Box::into_raw as well to get the pointer. I have a concern that there's a leak, though I don't know for sure.

I respect those who have done this longer to find the UB here on POD types. It "seems to work" but that's not good enough for me. I don't like it when people do that, so I'm not going to have my first substantial foray into Rust start that way either!

quinedot · January 19, 2022, 4:02am

You can use the bytemuck crate, as @kornel mentioned. This gets you

Zeroable instead of a Default implementation
- Though you could make a Default implementation from zeroed
A Pod bound for your generic function

And now I always reconstruct the Box<T> to avoid the leaking. (I didn't re-add alighment checking but probably should have.)

If the size of T can be statically known as well, you also get

A way to operate on a buffer and only convert after success
- More ergonomically avoids leakage
No unsafe of your own except the trait implementations
Allignment checking and whatever else I haven't thought of that bytemuck did

But as you can see, the current limitations of const generics require an additional trait implementation per type (as far as I know), if you want to keep things generic:

impl ReadStruct<{ mem::size_of::<MyBigStruct>() }> for MyBigStruct {}

Or you could have callers specify the size, but that seems worse to me.

There are falliable versions of allocation::* if you prefer.

With these read_exact versions, it's possible for the file to grow between your size check and reading in the struct, and for the method to still succeed. That might be a corner-case you don't care about. If you do care, you could try to read one more byte and check for EOF.

Michael-F-Bryan · January 19, 2022, 7:24am

This was actually fixed by @jkugelman

github.com/rust-lang/rust

Fix read_to_end to not grow an exact size buffer

rust-lang:master ← jkugelman:read-to-end-overallocation

opened 02:55AM - 22 Sep 21 UTC

jkugelman

+37 -41

If you know how much data to expect and use `Vec::with_capacity` to pre-allocate… a buffer of that capacity, `Read::read_to_end` will still double its capacity. It needs some space to perform a read, even though that read ends up returning `0`. It's a bummer to carefully pre-allocate 1GB to read a 1GB file into memory and end up using 2GB. This fixes that behavior by special casing a full buffer and reading into a small "probe" buffer instead. If that read returns `0` then it's confirmed that the buffer was the perfect size. If it doesn't, the probe buffer is appended to the normal buffer and the read loop continues. Fixing this allows several workarounds in the standard library to be removed: - `Take` no longer needs to override `Read::read_to_end`. - The `reservation_size` callback that allowed `Take` to inhibit the previous over-allocation behavior isn't needed. - `fs::read` doesn't need to reserve an extra byte in `initial_buffer_size`. Curiously, there was a unit test that specifically checked that `Read::read_to_end` *does* over-allocate. I removed that test, too.

H2CO3 · January 19, 2022, 10:01am

If T: Default then Box<T>: Default, too.

Didn't you do that exact thing in the piece of code you posted to begin with?

Well then you don't #[derive(Default)], you can implement it by hand.

All in all, something like this Playground should work:

pub unsafe fn read_raw_boxed<T, R>(mut reader: R) -> io::Result<Box<T>>
    where
        T: Default,
        R: io::Read,
{
    let mut value = Box::<T>::default();
    
    // SAFETY: the caller needs to ensure that `T` is POD.
    let slice = slice::from_raw_parts_mut(
        &mut *value as *mut T as *mut u8,
        size_of::<T>(),
    );
    reader.read_exact(slice)?;
    
    if reader.read_exact(&mut [0_u8]).is_err() {
        Ok(value)
    } else {
        Err(io::Error::new(io::ErrorKind::InvalidData, "file size doesn't match type"))
    }
}

By the way, there are multiple other problems with your code:

Trusting the metadata of the file for exact size is not robust, and reading the size first (separately) then doing the read blindly is simply incorrect, because it causes a TOCTOU race condition. You should instead do what is demonstrated above, i.e. just try reading and check if it succeeds.
Not sure how you plan to use the data structure, but beware that the compiler complains when I try to use the crc field when the struct is #[repr(packed)] because it is unaligned. It is most likely not what you meant, because unaligned values are generally a pain to get right, and very easy to get wrong.

(However, it is true that the struct must not contain any padding in order for it to be convertible to a [u8] byte slice, because padding is logically uninitialized, so creating a reference to it is Undefined Behavior. Others are probably right in that you should use bytemuck instead, so as to avoid having to deal with all these annoying and dangerous cases.)

EriolAncalagon · January 19, 2022, 1:37pm

Thanks for the input on this. Definitely going to look into bytemuck as well.

As for your other problems:

TOCTOU isn't a problem for utilities that aren't in a "pipeline" or other environment where things are getting written and read at the "same" time. This is for post-processing of produced files, not live producer/consumer. If this tool crashes because somehow somebody deleted or resized a file out from under the user, something else is deeply wrong on that system.
Oh the problems with packing. Early on in the project I'm working on (before I was even hired) somebody thought that putting #pragma pack(push, 1) around all of the structure definitions sent to the hardware was a good idea. And then they started adding things and pushing other values around. Eventually the structures "stabilized" and are basically unchangeable now, even if sometimes significant proportions of values are not even used anymore, but must be present because firmware was crafted on that standard as well from a different vendor. And so no, not everything is necessarily aligned for ease-of-use. I remember reading a while back that this would bite me later, but that's a "later" problem once the data is actually in.

Overnight I also realized that my latest code has a leak if reader.read_exact returns the Err variant, since ptr_from_box isn't freed since Box::into_raw(boxed_value) means the data is not managed.

kornel · January 19, 2022, 3:05pm

Default gives you initialization, but doesn't give you safety to load arbitrary bytes. Option<&'static T> implements Default, and will let you read arbitrary pointer, and then dereference it. That type is also 'static and Copy, so there really isn't a POD-safe type bound in Rust's std.

Another option if you don't have to match C structs, or can export an interface for C to use Rust's types:

or

EriolAncalagon · January 19, 2022, 3:43pm

@kornel, those seem like decent ideas, if I wasn't constrained by formats I cannot control. Luckily I can control what this will be used for (3 different file types/structures so far), and so "well if you use it with XYZ rust type, it'll break" isn't a large concern of mine. I'm decoding C-defined POD types, that's it. That it will break on some things you can pass in is OK for now. I have read the Rust Koans and so I know I'm violating the 3rd, but I will definitely be looking into bytemuck as others have suggested, so I can see if applying the POD trait there is appropriate to constrain this further so it can't be misused.

I have constrained the unsafe usage to a single line now, and I don't think I have any memory unsafety anymore, as long as T is a POD type, which I can't enforce with baseline Rust it seems.

// Reads in the specified struct from the specified path.  If there is not an exact match on size, returns an error
fn read_struct<T: Default, P: AsRef<Path>>(path: P) -> std::io::Result<Box<T>> {
    let path = path.as_ref();

    // Compare the type's size to that of the file we're reading from, returning on error
    let struct_size = ::std::mem::size_of::<T>();
    let num_bytes = fs::metadata(path)?.len() as usize;
    if struct_size != num_bytes {
        return Err(std::io::Error::new(
            std::io::ErrorKind::Other,
            format!(
                "Length of file did not match structure specified.  File length: {}, Expected: {}",
                num_bytes, struct_size
            ),
        ));
    }

    // Open the file to read from
    let mut reader = BufReader::new(File::open(path)?);
    // Initialize the memory we're going to write into
    let mut boxed_value = Box::new(T::default());
    // Get a pointer to write to directly from the file.  Pointer doesn't leak, as boxed_value still owns the memory
    let ptr_from_box: *mut T = &mut *boxed_value;
    // Turn the pointer we have into a slice so we can read into it from BufReader
    let buffer = unsafe { slice::from_raw_parts_mut(ptr_from_box as *mut u8, num_bytes)};
    // Do the actual read from disk into the memory that boxed_value allocated above.  If this goes wrong, boxed_value still owns the memory, and destructs it
    reader.read_exact(buffer)?;
    // Read went fine, so return the box, as it's been populated with the contents of the file
    Ok(boxed_value)
}

I agree that the Box::new_uninit() functionality is what I really want so I don't need to initialize it, but for stable right now I think what's above only has known limitations (TOCTOU, and POD dependence), and no actual flaws. But that's the whole point of posting this topic: to have those pointed out to me.

quinedot · January 19, 2022, 9:26pm

The POD dependence makes it unsound^[1], which is a flaw. bytemuck is stable, but if you don't want the dependency, you can define your own unsafe trait to put the onus on others to guarantee PODness.

i.e. possible for other code to cause UB without unsafe; use Miri under Tools in the Playground to see it in the example ↩︎

EriolAncalagon · January 20, 2022, 6:04pm

I appreciate all the help I've gotten from everybody here, but what surprises me the most is that there's no way to force the compiler to enforce the POD-ness of a type. The unsafe trait POD from bytemuck is the user telling the compiler "oh yes, this is totally fine" and not the compiler doing its own examination. That's what I want. I don't know if 3 months, 6 months, or years later somebody else is going to come along and alter the type so that it's no longer safe. Now best practices will hopefully mean that their error is caught, but this is "real life" where you work with all kinds of people, some more careful, and some not, and you can't guarantee that the developer with enough expertise will always be the one assigned to review.

Basically, I'm shocked that such a feature doesn't already exist. It seems much simpler to make a whitelist of allowed POD types (and types made only of those) than to just have a marker type and rely on the developer to ensure such. Maybe I'll have to figure out the governance model for Rust and propose such a Trait to be added.

scottmcm · January 20, 2022, 6:24pm

H2CO3 · January 20, 2022, 6:34pm

Note that your requirements are actually much more stringent than simple POD-ness. (I think the terminology used by bytemuck here is perhaps not the best, but then again, names have to be of a reasonable length, and one can't express everything completely unambiguously in a single name, so I guess that's a reasonable approach.)

In particular requiring packing (and therefore a potential lack of alignment and padding) is something considered pretty exotic and usually not recommended (even though I do appreciate that you are not in control of the definitions here – I'm only commenting on the situation in general, nothing accusatory). This means that the language is unlikely to deeply embed support for such atypical use cases.

Usually, the serialization of in-memory data structures ideally goes through some sort of translation layer that abstracts away platform and implementation differences, so that one doesn't have to rely on the exact in-memory representation of data in one language when used from another language (or OS, or compiler, or hardware, or…). Since memory layout rules in Rust are somewhat complex and for the most part not guaranteed (which is for future compatibility and implementation flexibility), this should be expected, and it should be (and is) the majority scenario.

EriolAncalagon · January 20, 2022, 6:48pm

Thanks @scottmcm. They say the best way to get an answer on the internet is to state the wrong answer, and I feel my statement above about "no way to do this" prompted you.

@H2CO3 your statement of:

Usually, the serialization of in-memory data structures ideally goes through some sort of translation layer that abstracts away platform and implementation differences, so that one doesn't have to rely on the exact in-memory representation of data in one language when used from another language (or OS, or compiler, or hardware, or…)

I think we work in very different industries. What you say is true of anything internet-connected that must deal with different clients. I mainly have worked either in network environments where you control all of the hardware and software, or in embedded next to firmware designers that are all about binary-representation of things.

Trying to get some of these people not to just use binary dumps, but anything else so that humans can read and edit it, is a challenge in itself. Many state directly to my face that it's "more efficient" to use binary configuration files since it's sent down to the hardware eventually, even though making it human readable (and editable) would be fine, and process into the binary form when the application reads it from disk seems weird to them. And yes, it's only a few files per run, this isn't about performance.

And then there are bugs in the "editor" you use to work with these binary files, so that sometimes even what's in binary is not correct, and only another tool to examine it, and/or print it out to human-readable (usually .csv, because so many people love Excel) catches that the Editor has a bug.

This is just kind of an extension of above: you can't assume everybody you work with, or that will work with the code you make will make reasonable decisions about data representation, and sometimes you just have to deal with what was done, commonly. I'm not looking to perpetuate it (and I push against it all the time), but I do have to deal with it.

quinedot · January 20, 2022, 6:55pm

Semantic considerations aside, bytemuck can do "recursively POD" check if you're repr(C).

#[derive(Clone, Copy, Debug, Pod, Zeroable)]
#[repr(C, packed)]
struct MyBigStruct {
    bulk: [u8; BIG_STRUCT_SIZE - mem::size_of::<u32>()],
    crc: u32,
}

It's not uncommon for functionality like this to be in crates. In fact if a proposed new stdlib feature can bake in a crate versus being part of the standard library, that's often the preferred route. One main downside is that there's no list of "officially approved crates" or the like, so discovery is a bit of a problem, especially when starting out.

E.g. serialization is more common than loading binary structures straight from disk... and the standard solution for that (serde) is also it's own (ecosystem of) crate(s).

That said, sometimes things do get moved into the standard library from crates.

Topic		Replies	Views
Read into struct help	20	10409	November 3, 2019
Reading Binary Data From File	7	20865	July 8, 2020
VANILLA rust working with bytes help	6	872	January 12, 2023
Reading structures in memory via pointers help	18	11040	January 22, 2020
Read struct from bytes help	4	1421	March 28, 2024

Reading c-style structures from disk

Related topics