Read into struct

Hey everyone, I'm back with another question if anyone has some spare time. I've been working with reading binary data from a file and am wondering if anyone had any pointers. So far from what I've seen, I've been able to get a working version by using the byteorder crate like this.

#[derive(Debug)]
#[repr(C)]
struct Header {
    name: [u8; 4],
    size: u32,
    reserved: u64,
}

pub fn parse<P: AsRef<Path>>(path: P) -> std::io::Result<()> {
    let file = File::open(path)?;
    let mut reader = BufReader::new(file);
    
    let header = unsafe {
        let name: [u8; 4] = mem::transmute(reader.read_u32::<LittleEndian>()?);
        let size = reader.read_u32::<LittleEndian>()?;
        let reserved = reader.read_u64::<LittleEndian>()?;

        Header { name, size, reserved }
    };
    
    println!("{:?}", header);

    Ok(())
}

After I got that working, I was curious if there was an easy way for me to read fixed data into the struct in a similar way to C with

Header header;
fread(file, &header, sizeof(header));

I've found a few examples online, but they all seem to use pretty advanced techniques like this to manipulate the data.

let dst_ptr = &mut header_buf as *mut Header as *mut u8;
let mut slice = slice::from_raw_parts_mut(dst_ptr, header_size);

I tried playing around with that, but when trying something like that, my [u8; 4] name array would be filled with different values every time I ran the code, so it might have been undefined behavior? Is there a similar way to read sizeof struct amount of data in if it's fixed every time? Any help would be greatly appreciated.

First of all, leave unsafe as an absolutely last option like an emergency flashlight. In most cases there's some safe options that can express your intent clearly, without noticable performance loss after optimization. In safe Rust compiler provides tons of safety guarantees which C compilers doesn't care about, but in unsafe Rust it's your responsibility to provide all those guarantees by hand. Again, unsafe Rust is more unsafe than C/++.

transmute::<u32, [u8; 4]>() can be done using u32::to_{be,le,ne}_bytes(), but I recommend to zero-init the array first and fill it using Read::read(), as those zero-init will easily be optimized out anyway.

I see there's no padding between fields in Header, but I'm not sure if compilers treat padding-less type specially and loosen strict aliasing rule for them. Anyway, It's still possible that some other looks-unrelated unsafe code triggers this UB symptoms so I can't be sure anything without your full code.

3 Likes

The reason we avoid support that kind of reads is that their behaviour is often platform-dependent. For example if u32 is big endian on one machine and little endian on another, the file wont be properly converted.

As Hyeonu also mentioned, I recommend using u32::to_le_bytes to create the array without unsafe. Alternatively you can use serde.

Of course, if you must, it can be done:

use std::mem::{size_of, transmute};
use std::io::{Read, Write, Result};
use std::path::Path;
use std::fs::File;

#[derive(Debug)]
#[repr(C)]
struct Header {
    name: [u8; 4],
    size: u32,
    reserved: u64,
}

fn parse<P: AsRef<Path>>(path: P) -> Result<()> {
    let mut file = File::open(path)?;

    let header: Header = {
        let mut h = [0u8; size_of::<Header>()];

        file.read_exact(&mut h[..])?;

        unsafe { transmute(h) }
    };

    println!("{:?}", header);

    Ok(())
}
fn write<P: AsRef<Path>>(path: P, h: Header) -> Result<()> {
    let mut file = File::create(path)?;

    let bytes: [u8; size_of::<Header>()] = unsafe { transmute(h) };

    file.write_all(&bytes)?;

    Ok(())
}

fn main() {
    let h = Header {
        name: [1,2,3,4],
        size: 0xabcdef,
        reserved: 0x0123456789abcdef,
    };

    write("/tmp/test.dat", h).unwrap();
    parse("/tmp/test.dat").unwrap();
}

My hex-dump utility prints this when given the file:

┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
│00000000│ 01 02 03 04 ef cd ab 00 ┊ ef cd ab 89 67 45 23 01 │••••×××0┊××××gE#•│
└────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘

Of course on your machine, you might get 01 02 03 04 00 ab cd ef 01 23 45 67 89 ab cd ef.

Note that this might be undefined behaviour if the type has padding or isn't #[repr(C)].

1 Like

Thanks for all the information so far. From what I see, u32::to_le_bytes will only work for converting a single u32 -> [u8; 4] correct? The end result is to read in large chunks of data from a binary file, that follows a flat format and is always stored as little endian in the file. An example of the file is located here https://en.uesp.net/morrow/tech/mw_esm.txt. The only issue I can see with u32::to_le_bytes is in the cases when I get to a field in the struct that's 256 bytes long. I would still need to use some method other then u32::to_le_bytes to convert this correct?

I really like this example for the cases where the structs have ~20 fields to fill in. The only downside is that is uses unsafe to transmute the buffer into the struct. Is there a way to do this in rust without the unsafe keyword?

let header: Header = {
        let mut h = [0u8; size_of::<Header>()];

        file.read_exact(&mut h[..])?;

        unsafe { transmute(h) }
    };

What if one of the fields has a non-allowed binary value, such as a field that is constrained to never be 0x42? How can the compiler prove that won't occur? Answer: It can't, so you have to take on that responsibility via the unsafe prefix.

What about

let mut h = [0u8; 256];
file.read_exact(&mut h[..])?;

So if I'm understanding correctly, you can either read in the entire struct into a buffer and transmute it using unsafe, or read in the individual fields safely and then assign the struct? These two ways of assigning would theoretically return the same struct as long as it has #[repr(C)] and has no padding?

#[repr(C)]
struct Header {
    name: [u8; 4], // Could be any size i.e. [0u8; n]
    size: u32,
    reserved: u64,
}
let header: Header = {
    let mut h = [0u8; mem::size_of::<Header>()];

    reader.read_exact(&mut h)?;

    unsafe { mem::transmute(h) }
};
let header: Header = {
    let mut name = [0u8; 4]; // Could be any size i.e. [0u8; n]
    reader.read_exact(&mut name)?;

    let size = reader.read_u32::<LittleEndian>()?;
    let reserved = reader.read_u64::<LittleEndian>()?;
    // Rest of the fields needed

    Header { name, size, reserved, /* Rest of the fields needed */ }
};

No. For this to be the case, you also need the processor to be little endian.

I strongly recommend just reading each field separately. For super large structs you can use macros to read all the fields.

1 Like

That is correct, provided that there are no Rust-aware constraints on the content of any byte of the structure, including internal inter-field consistency constraints. When such constraints exist, you the programmer have to assume the responsibility of ensuring that the data within the Rust structure meets all such constraints. You can do that with a bulk read followed by unsafe transmute, or by reading the data piecemeal and then creating or updating the structure in such a way that there are never any uninitialized fields, other than MaybeUninit<T> fields, at a time when the structure could be dropped.

1 Like

Is there documentation, or an example you could point me to that shows using a macro to read all the fields on a large struct?

Well, the biggest example is the crate serde, which does it using a #[derive(Deserialize)] macro, but take a look at this bare bones example.

Playground.

1 Like

With some help from type inference, here's a macro that works with several types of fields: Playground.

1 Like

I should have time tonight to look into this more. I'd prefer not to use unsafe if I don't have the. Thanks for helping explain all this and showing examples. I'll see if I can get this working for my scenario and report back.

So I started taking a look at the macro example you gave me, and looking into serde a bit more. While doing that I found a crate called bincode that seems like it do what I'm looking for and uses serde. I was able to get a version working using bincode similar to your example, but without the unsafe keyword.

#[derive(Deserialize, Debug)]
struct Header {
    name: [u8; 4],
    size: u32,
    reserved: u64,
}

let header: Header = {
    let mut encoded = [0u8; mem::size_of::<Header>()];

    reader.read_exact(&mut encoded)?;
    bincode::deserialize(&encoded[..])?
};

Do you have any experience with this? Would this also be a safe viable option since I'm not transmuting anything?

2 Likes

Three great crates should offer features that are useful for what you are looking for:

  • ::bincode

  • ::nom

  • ::zerocopy

    • this one has #[derive(FromBytes)] that should make compilation fail if there is padding or if the struct is not #[repr(C)]

Whenever a pattern becomes cumbersome, try and see if there aren't crates out there that help doing the job. As @Hyeonu pointed out, using unsafe ought to be avoided, and using it for "ergonomics" doesn't feel right.

1 Like

Yep! Note that you can do this:

let header: Header = bincode::deserialize_from(&mut reader);

Then you don't need to know how many bytes it is.

1 Like

Notice of course that bincode cannot be used to read arbitrary binary streams. It is basically its own serialization format, and so it can really only be used to deserialize whatever you are capable of producing with Serialize. Thankfully, for flat data structures like your Header, this can actually work pretty well because the serialization of integer types and fixed-size arrays is very straightforward. (it also has options for endianness)

I've used nom in the past for reading binary files. It's pretty good. do_parse! in particular is great for reading arbitrary structs. I think this example would look something like (note: not tested):

named!{header, do_parse!(
    name: map!(u32!(Endianness::Little), |x| x.to_le_bytes()) >>
    size: u32!(Endianness::Little) >>
    reserved: u64!(Endianness::Little) >>
    (Header { name, size, reserved })
)}

Though I do have a number of complaints:

  • the documentation is very sparse and poorly formatted
  • I've always been frustrated by its lack of trailing comma support (often leading to inscrutible error messages when one is accidentally supplied), and the author seems to be against adding it.
  • I was very confused by how nom version 4 works with regards to "incomplete" parsing, and could never write a working parser (I stuck to version 3).

It looks like nom 5 is out now, which may actually address a number of these complaints; It seems to have specifically focused on improving the "complete parser" situation from nom 4. Also, some of the macros are now provided as generic functions, which may perhaps improve the situation with error messages and commas.

In nom 5 this could be written

use nom::{number::complete::{le_u32, le_u64}, IResult};

fn header(input: &[u8]) -> IResult<&[u8], Header> {
    let (input, name) = map(le_u32, |x| x.to_le_bytes())(input);
    let (input, size) = le_u32(input);
    let (input, reserved) = le_u64(input);
    Ok((input, Header { name, size, reserved }))
}

// usage
// let data: [u8; _] = ...;
// let h: Result<Header, _> = header(&data[..]).map(|x| x.1);

(you could use nom::sequence::tuple to not have to duplicate input so many times, but the let way makes it look more like the older do_parse way.)

This is for a non-streaming parse version (i.e., where you expect to have all the data before you call the parser) - if you are streaming this over a connection, you'd want to switch the import to nom::number::streaming instead of nom::number::complete.

On the topic of the docs, I think Nom 5 has been a monumental improvement on that front - there's now several really detailed annotated examples, and the API docs are a ton easier to navigate.