Reading Binary Data From File

travis · April 8, 2020, 2:38am

I've been creating some loaders for reading binary files, but I've been going about it pretty naively: I read the entire file into memory, and have an offset. Then, whenever I encounter a sub-structure, I call that sub-struct's ::read() method which will take the current massive buffer and a mutable reference to that offset. That reader will then read from the buffer piece by piece, using byteorder to ensure correct endian-ness on each piece. Each time I read something in, I increment the offset by the amount read from the massive buffer. Here's the source file as reference.

This works, but it's repetitive and requires the entire file being loaded into memory. Is there a better general approach to reading data? I tried finding some more info online, but I couldn't find a comprehensive resource that goes into detail, so I could wrap my head around it. I also searched these forums, but most of the posts I found were from 2015, which was around Rust's v1.0 release, and a lot has changed since then.

The next loader will be parsing this archive file that's close to 400MB, and I have to load a 120KB companion file ahead of time as it acts like a directory for that archive. Now, I could do this, and it'll still work. The main purpose of this project is to learn Rust, and get comfortable with it. I have one strategy for loading files, but I want to progress that skill.

There's going to be tons and tons of one-off reads, so it looks like I should wrap my File object in a BufReader, so that it can grab larger chunks of the file ahead of time to vastly cut down on read operations on the storage hardware. I'm not quite sure how I'm supposed to use BufReader to read data though. Is that where Cursor comes in?

Then, there's mapping pieces of the data to structs. For example:

pub struct Header {
    records: u32,
    number_of_files: u32,
    names_table_size: u32,
    archive_full_size: u32,
    pad: [u8; 0x10],
}

How should I go about reading this into memory? Each instance of this struct is 36 bytes, so should I grab a 32-byte slice, and somehow map that to my struct instance?

Looking at the format of the binary data, it looks like I can read the entire file sequentially without ever having to seek around the file. I looked at memmap, but I read that it's unsafe. I have no problem using unsafe code, but I'm trying to figure out when to use it. I also looked at transmute which looks quite convenient, but as noted in the docs, it is incredibly unsafe as there are many points of error in the simplest of structs, and this isn't C after all.

zicklag · April 8, 2020, 3:01am

Because BufReader implements Read you can use it just like you would the underlying Read-er such as a file. BufReader's do additionally implement BufRead which means that you get extra functions such as read_line that aren't implemented bystd::fs::File.

Cursor doesn't come into play here. There point of Cursor is to implement Seek on something that otherwise doesn't, such as an in-memory Vec.

I don't have time to make a full reply at the moment, but my first search yields a crate that might help:

travis · April 8, 2020, 4:20am

Thanks, @zicklag, I'll check out structview.

Speaking of the Read and Seek traits, should my structs be implementing those? It seems like my current paradigm is already doing that. The Read trait only has a read() method that needs to be implemented. The thing is, if I made my structs conform to Read, what uses these readers? Do I just instantiate can all header_reader.read() directly? If so, it almost feels like it defeats the whole purpose of using a trait if there's not higher-level processor type that can takes reader instances and can return a deserialized version of it.

For example, here's my current implementation of reading the file header from TIM2 files:

const IDENT: u32 = 0x54494d32;

#[derive(Debug)]
struct Header {
	identifier: u32,
	version: u16,
	count: usize,
}

impl Header {
	fn read(buffer: &[u8], offset: &mut usize) -> Result<Header, Error> {
		let mut load_part = |size| { get_slice(&buffer, offset, size) };

		let identifier = BigEndian::read_u32(load_part(4));
		let version = LittleEndian::read_u16(load_part(2));
		let count = LittleEndian::read_u16(load_part(2)) as usize;

		load_part(8);
		if identifier != IDENT {
			return Err(Error::InvalidIdentifier(identifier))
		}

		Ok(Header { identifier, version, count })
	}
}

#[derive(Debug)]
pub struct Image {
	header: Header,
	frames: Vec::<Frame>,
}

impl Image {
	fn read(buffer: &[u8], offset: &mut usize) -> Result<Image, Error> {
		let header = Header::read(buffer, offset)?;
		let mut frames = Vec::with_capacity(header.count);

		for _ in 0..header.count {
			frames.push(Frame::read(buffer, offset)?);
		}

		Ok(Image { header, frames })
	}

	pub fn frames(&self) -> &Vec::<Frame> {
		&self.frames
	}

	pub fn get_frame(&self, index: usize) -> &Frame {
		&self.frames[index]
	}
}

pub fn load<P: AsRef<Path>>(path: P) -> Result<Image, Error> {
	let mut offset = 0usize;
	let mut buffer = Vec::new();
	let mut file = File::open(path)?;

	file.read_to_end(&mut buffer)?;
	Image::read(&buffer, &mut offset)
}

The Header type has a read() method, and it takes a buffer as input just like the Read requires. There are a few difference, but the biggest one is that my read() takes a mutable offset as an input, so I can increment it from within the file. This is unfortunate seeing that this is a struct-level method and not an instance-level one, yet it's impure. My version also returns an instance of the Header itself instead of the bytes read. I'm not sure how I'd address this with the Read trait. Maybe Header shouldn't be responsible for deserializing itself? Maybe I should make a reader type? Still, how does that work in practice? Like this, maybe?

struct HeaderReader {
	result: Header, // assume Copy + Clone have been implemented
}

impl HeaderReader {
	pub fn result(&self) -> Header {
		result
	}
}

impl Read for HeaderReader {
	fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
		let mut load_part = |size| { get_slice(&buffer, offset, size) };

		let identifier = BigEndian::read_u32(load_part(4));
		let version = LittleEndian::read_u16(load_part(2));
		let count = LittleEndian::read_u16(load_part(2)) as usize;

		load_part(8);
		if identifier != IDENT {
			return Err(std::io::Error::InvalidData)
		}

		self.result = Header { identifier, version, count };
		Ok(16)
	}
}

pub fn load<P: AsRef<Path>>(path: P) -> Result<Image, std::io::Error> {
	let mut offset = 0usize;
	let mut buffer = Vec::new();
	let mut file = File::open(path)?;
	let mut header_reader = HeaderReader {
		header: Header::new(),
	};

	file.read_to_end(&mut buffer)?;
	offset += header_reader(&mut buffer)?;

	header_reader.result()
}

H2CO3 · April 8, 2020, 6:07am

It looks like to me what you really need is a proper database, indexed by whatever substructures you happen to be interested in extracting.

travis · April 8, 2020, 6:16am

To give a bit of context: the files I'm parsing are asset archives from the the PSP version of Final Fantasy IV. These two files I'm parsing are effectively a database in a storage sense. The massive 400MB file is basically a database where each record is a binary blob. Each binary blob is a self-contained file, and doesn't contain any metadata of the record. The directory file on the other hand, holds all of metadata for each record.

What I'm building is an extractor. I want to parse the directory file in its entirety, and then use it to iterate through every record, and use that data to slice each file out of main archive, and copy them into their own files.

zicklag · April 8, 2020, 11:49pm

I'm not a super smart Rust expert yet, but here's my thoughts.

OK, I skimmed your code and right off it seems like it's pretty well done. I think the way that you extracting the binary data with your load_part closure like you do here, is good enough and there isn't necessarily a super compelling reason to use something like structview unless you like the nicety of the code that structview would let you write ( which is arguably easier to read and write ).

No, I don't think so. While they do have a read function, the purpose of the read function is different in the Read trait than it is for your structs and I am with you in the fact that it doesn't seem like the trait would help you much anyway.

The purpose of the Read trait is to be implemented by things that can return a byte stream. In the case of your read function, though, you are actually wanting to take bytes and return a Rust struct such as your Header type.

Here's an example of how I would imagine doing it: (playground)

#![allow(unused)]

use std::error::Error;

use byteorder::{ByteOrder, LittleEndian};

use std::fs::OpenOptions;
use std::io::BufReader;
use std::io::prelude::*;

fn main() -> Result<(), Box<dyn Error>> {
    // Let us pretend this is your data file
    let file = OpenOptions::new().read(true).open("/etc/passwd")?;
    
    // File implements reader so, because we have included the `std::io::prelude::*`
    // the `Read` trait is in scope and we can use all the `Read` traits functions
    
    // Now we wrapt the file in a BufReader. BufReaders can wrap anything that implements
    // `Read`. This takes ownership of our reader and we now use the `BufReader` whenever
    // we want to get data from the file. It will automatically be buffered now!
    let mut buf_reader = BufReader::new(file);
    
    // Now lets pretend that one of your data structures you are loading looks like this:
    #[derive(Debug)]
    struct Header {
        total_size: u32,
        palette_size: u32,
    }
    
    // We need const 
    
    // And we implement your `read` function, which I renamed to `load` for clarity
    // We want to load the `Header` from the buffered bytes in our file.
    impl Header {
        // We want load to take any type that implements `Read`. That means that we
        // can pass in our `BufReader` because `BufReader` implements `Read`.
        fn load<T: Read>(reader: &mut T) -> Header {
            // Create our utility closure
            let mut load_part = |size| {
                // Create a buffer to load data into.
                // Because we are dynamically
                // determining the size based on the `size` we can't use an array
                // and must use a Vector which will allocate on the heap and be
                // slower. We could get around this by using a macro
                // instead of a closure which would expand to static code at compile
                // time. ( I can show you how to do that if you want )
                let mut buf = Vec::with_capacity(size);
                
                // Get a reader for the next `size` amount of bytes
                let mut part_reader = reader.take(size as u64);
                
                // Read the part into the buffer
                part_reader.read_to_end(&mut buf).unwrap();
                
                // Return the buffer
                buf
            };
            
            // Now we construct our header and return it
            Header {
                total_size: LittleEndian::read_u32(&load_part(4)),
                palette_size: LittleEndian::read_u32(&load_part(4)),
            }
        }
    }
    
    // Load our header from our reader
    dbg!(Header::load(&mut buf_reader));
    
    Ok(())
}

With this setup you don't load the whole file into memory and all reads are buffered. It does allocate a vector on the heap for every read which is not ideal and could be gotten around by using a macro instead of a closure for the load_part utility ( playground ):

// And we implement your `read` function, which I renamed to `load` for clarity
    // We want to load the `Header` from the buffered bytes in our file.
    impl Header {
        // We want load to take any type that implements `Read`. That means that we
        // can pass in our `BufReader` because `BufReader` implements `Read`.
        fn load<T: Read>(reader: &mut T) -> Header {
            // Create a helper macro for loading an array of the given size from
            // the reader.
            macro_rules! load_part {
                // Take an argument `size` which is a literal
                ($size:literal) => {
                    // The whole body goes into a scope so that it is a valid
                    // expression when the macro gets expanded.
                    {
                        // Create a buffer array of the given size. This works because `$size`
                        // gets expanded in this code at comiple time as a literal number.
                        let mut buf = [0u8; $size];
                        
                        // Read into the buffer
                        reader.read_exact(&mut buf).unwrap();
                        
                        // The buffer
                        buf
                    }
                }
            }
            
            // Now we construct our header and return it
            Header {
                total_size: LittleEndian::read_u32(&load_part!(4)),
                palette_size: LittleEndian::read_u32(&load_part!(4)),
            }
        }

Macros can be confusing especially at first. If you have any questions about how that works I could explain more.

travis · April 9, 2020, 5:34am

Woah, that's an excellent example. Thanks, @zicklag! One of the things that tripped me up was that I wasn't sure how to read the next chunk of bytes that's I'd immediately need into BufReader, and it looks like the answer is to use its take(). method create a sub-reader, then populate that reader with bytes. I wasn't sure if I was to use take(), read_exact(), or some other method.

From the few examples I could find online, it seemed like people were using nested readers to get nested data, which makes a lot of sense. The docs were saying that it "passes and underlying reader" or something like that, but the reader shouldn't be used directly or something. I wasn't sure what that meant exactly.

Thanks for the macro example as well. I've been putting off learning macros because they're a pretty beastly subject. Useful, but I can see myself abusing it easily when there are so many more fundamental things in Rust that I'm not comfortable in yet. Like generics... for some reason those are tripping me up, and I used to be really good with them in C# or C++ templates.

This looks awesome though. I think this does it for me.

system · July 8, 2020, 5:34am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Reading binary data to structs help	2	170	April 6, 2024
Can I use BufReader with_capacity for large file help	14	2921	August 12, 2020
Read into struct help	20	8803	November 3, 2019
Reading Binary Files - A trivial program not so trivial for me help	6	6231	May 27, 2021
Reading binary files help	8	22685	July 3, 2022

Reading Binary Data From File

Related Topics