Deserializing a .dat binary file created in CPP

Luckily, it's not the answer. Serde is useful when you have a well-defined format like JSON which has things like "objects", "arrays", "strings", etc.

Instead, it sounds like the tool writes the bytes of your spkr struct directly into the file as-is with no format per-se. Then to "deserialize" the data it will read the file's contents into memory and blindly assume that memory now contains a spkr struct.

If you want a quick and dirty solution, we can do the same in Rust. Here is a link to the full code on the playground, but I'll go through it step by step so you understand enough to make tweaks and extensions.

Make a Speaker Struct

First define an equivalent struct and use #[repr(C)] to tell the compiler to lay it out like C would.

// Note: We get this from the Windows documentation.
// https://docs.microsoft.com/en-us/windows/win32/winprog/windows-data-types#word
type WORD = u16;

#[derive(Debug)]
#[repr(C)]
pub struct Speaker {
    name: [[u8; 20]; 2],
    addr1: [u8; 40],
    addr2: [u8; 40],
    phone: [u8; 16],
    tks: [u8; 32],
    tfs: [u8; 32],
    /// b1 elder; b2 ms; b5 suspended;
    ems: u8,
    asn: u8,
    cng: WORD,
    num: WORD,
    xtp: WORD,
    blk: u8,
    spr: u8,
    spare: [u8; 46],
}

Blindly Assuming the File Contains a Speaker

Next we can do the actual "deserializing". You can do this by reading the file's contents into memory and use an unsafe transmute to tell the compiler to reinterpret the bytes as a Speaker.

use std::{io::{Read, Error}, fs::File};

fn main() {
    let mut f = File::open("my_speaker.dat").expect("Unable to open the file");
    let mut buffer = [0_u8; std::mem::size_of::<Speaker>()];

  let bytes_read = f.read(&mut buffer).expect("Read failed");
  
  assert_eq!(bytes_read, buffer.len(), "Didn't read all the bytes into memory");

  // Safety: See below
  let speaker: Speaker = std::mem::transmute(buffer);
  println!("Speaker: {:?}", speaker);

  Ok(())
}

Note that there are several assumptions we are making here:

  • The file was generated by our C++ tool and that the tool did everything correct
  • All fields in Speaker are valid for any possible bit pattern (i.e. the 40 bytes in the addr1 field can take on any value and it still leads to a valid [u8; 40], potentially an unintelligible string, but still a valid byte array)

As the wording of this section implies, blindly reinterpreting bytes is normally quite frowned upon because it can lead to your application breaking in lots of weird and wonderful ways. In this case I'd say it's okay though because

  1. The C++ tool sounds like it won't be changing any time soon so if our code is correct now it should keep working
  2. You'll be running this in a trusted environment and have full control of all the input
  3. The worst that can happen is your app crashes and you are unable to read it in a Rust program so you'll need to use other methods to extract the data. Nobody dies, your customer database won't be hacked, and nothing bad happens other than a spot of inconvenience

Making Things Convenient

Your next big hurdle will be interpreting the fields as useful string types. Printing a byte array will just show the numeric values for each of the bytes, so let's give Speaker some helper methods which will give us access to the various fields as strings if they are valid UTF8.

impl Speaker {
    pub fn address_line_1(&self) -> Option<&str> {
        std::str::from_utf8(&self.addr1).ok()
    }
    
    pub fn name(&self) -> Option<(&str, &str)> {
        let [first_name, last_name] = &self.name;
        
        let first_name = std::str::from_utf8(first_name).ok()?;
        let last_name = std::str::from_utf8(last_name).ok()?;
        
        Some((first_name, last_name))
    }
}

(the rest of the helper methods are left as an exercise for the reader)

Note that std::str::from_utf8() returns a Result<&str, Utf8Error> where the Utf8Error contains extra information about where it first encountered invalid UTF8 in the byte array. We don't care about that and only want to see if a valid UTF8 string is valid, so we use the ok() method to convert to an Option<&str>. We then use the ? operator to say "if this Option is None, return None, otherwise extract the wrapped value so it can be assigned to first_name".

Packing

There is this concept called "packing" which is quite important in telling the compiler how to lay out our Speaker and spkr structs in memory. We are interpreting a bunch of bytes as a Speaker it is very important for us that Speaker and spkr are laid out identically otherwise we'll get garbage.

You see, processors really like it when things are lined up in memory correctly and they often need to do extra work when they aren't lined up (which kills performance) or will just error out altogether (which kills your program). For example, a u8 can be placed at addresses that are multiples of 1 byte (i.e. anywhere), a u32 can be placed at multiples of 4 bytes, and so on.

To deal with this alignment issue, compilers will insert spacing between fields to make sure they line up correctly - this is actually what the #[repr(C)] attribute does. If we want to tell the compiler not to insert this spacing we can use #[repr(packed)] to tell the compiler "this struct's bytes must be packed together as closely as possible".

We get lucky here because the cng field is at offset 202, which is a multiple of 2 bytes, but otherwise we would need to use #[repr(packed)] instead of #[repr(C)].

Better Alternatives

There are a lot of better alternatives out there. If I was doing this as part of a commercial product where I didn't have full control over the input I would definitely reach for a better tool.

In this case, probably a parsing library like nom.

7 Likes