Deserializing a .dat binary file created in CPP

Hello everyone,

As the title implies, I'm trying to deserializing from a binary data file which was originally created with Microsoft Visual C++.

Background information

I schedule speakers every week for an organization. I use a program that doesn't have all of the functionality that I would like it to have. The program is written with Visual Studios C++ on a Windows environment. I'm on Linux and run the program via Wine.

The program outputs all the data to a binary file instead of using a database or JSON.

Provided I have a lot of history locked into the data files, I don't want to just walk away from it without the data if I can do something about it.

I'm not a professional programmer, more of a weekend hobbyist.

What I want to achieve

I want to be able to deserialize the data stored within the .dat binary files so that I can create additional tools that use the data stored in the files. For example, a tool to automatically email speakers when their arrangements approach would be nice.

The .data files

As mentioned before, the program was made with Visual Studios C++. I've contacted the author of the program and asked if he would be willing to write a way to export the data. He said a feature like that is not planned but was kind to give me the structure of each data file (there are a few).

Example of the structure of the speaker data file

speakers.dat
struct spkr {
    char name[2][20];
    char addr1[40];
    char addr2[40];
    char phone[16];
    char tks[32];
    char fts[32];
    char ems; // b1 elder; b2 ms; b5 suspended;
    char asn;
    WORD cng;
    WORD num;   
    WORD xtp;
    char blk;
    char spr;   
    char spare[46];
            };

An example of what it looks like when I run $ strings on the speaker .dat file:

1   │ Aaron
   2   │ James
   3   │ Chris
   4   │ White
   5   │ 555-555-5555
   6   │ Dave
   7   │ Roberts
   8   │ Chris
   9   │ Rockefeller
  10   │ 555-555-5555
[...]

Types

Using A Guide to Porting C/C++ to Rust a reference.

The CPP char type is like a Rust's i8 or u8 and strings are made from an array of chars.

Not sure what a WORD is.

My question to you all

I have searched around and I've seen a lot of potential solutions ranging from #[repo(c)], bingen, cxx, bincode, and so on.

A lot of conversation is dedicated to using C/Cpp libraries from within Rust but that's not my situation. I only want to deserialize binary file that was made with a C++ application provided I have the structure used in it's creation.

Armed with the information above, how would you tackle this problem?

I'm going to laugh if Serde is the answer...

If the file contents in it entirety look like either 1 instance of the struct you provided, or a sequence of them, and the values in the char arrays are all valid UTF-8, then indeed serde would be the first thing I'd try.

In fact I'd only move on from that if either serde somehow couldn't do the job, or was too slow for this particular use case.

When I run $ strings on the data file I get back something like this (I've edited the output to remove personal ID info):

   1   │ Aaron
   2   │ James
   3   │ Chris
   4   │ White
   5   │ 555-555-5555
   6   │ Dave
   7   │ Roberts
   8   │ Chris
   9   │ Rockefeller
  10   │ 555-555-5555
[...]

I'm guessing these are the chrs...

Luckily, it's not the answer. Serde is useful when you have a well-defined format like JSON which has things like "objects", "arrays", "strings", etc.

Instead, it sounds like the tool writes the bytes of your spkr struct directly into the file as-is with no format per-se. Then to "deserialize" the data it will read the file's contents into memory and blindly assume that memory now contains a spkr struct.

If you want a quick and dirty solution, we can do the same in Rust. Here is a link to the full code on the playground, but I'll go through it step by step so you understand enough to make tweaks and extensions.

Make a Speaker Struct

First define an equivalent struct and use #[repr(C)] to tell the compiler to lay it out like C would.

// Note: We get this from the Windows documentation.
// https://docs.microsoft.com/en-us/windows/win32/winprog/windows-data-types#word
type WORD = u16;

#[derive(Debug)]
#[repr(C)]
pub struct Speaker {
    name: [[u8; 20]; 2],
    addr1: [u8; 40],
    addr2: [u8; 40],
    phone: [u8; 16],
    tks: [u8; 32],
    tfs: [u8; 32],
    /// b1 elder; b2 ms; b5 suspended;
    ems: u8,
    asn: u8,
    cng: WORD,
    num: WORD,
    xtp: WORD,
    blk: u8,
    spr: u8,
    spare: [u8; 46],
}

Blindly Assuming the File Contains a Speaker

Next we can do the actual "deserializing". You can do this by reading the file's contents into memory and use an unsafe transmute to tell the compiler to reinterpret the bytes as a Speaker.

use std::{io::{Read, Error}, fs::File};

fn main() {
    let mut f = File::open("my_speaker.dat").expect("Unable to open the file");
    let mut buffer = [0_u8; std::mem::size_of::<Speaker>()];

  let bytes_read = f.read(&mut buffer).expect("Read failed");
  
  assert_eq!(bytes_read, buffer.len(), "Didn't read all the bytes into memory");

  // Safety: See below
  let speaker: Speaker = std::mem::transmute(buffer);
  println!("Speaker: {:?}", speaker);

  Ok(())
}

Note that there are several assumptions we are making here:

  • The file was generated by our C++ tool and that the tool did everything correct
  • All fields in Speaker are valid for any possible bit pattern (i.e. the 40 bytes in the addr1 field can take on any value and it still leads to a valid [u8; 40], potentially an unintelligible string, but still a valid byte array)

As the wording of this section implies, blindly reinterpreting bytes is normally quite frowned upon because it can lead to your application breaking in lots of weird and wonderful ways. In this case I'd say it's okay though because

  1. The C++ tool sounds like it won't be changing any time soon so if our code is correct now it should keep working
  2. You'll be running this in a trusted environment and have full control of all the input
  3. The worst that can happen is your app crashes and you are unable to read it in a Rust program so you'll need to use other methods to extract the data. Nobody dies, your customer database won't be hacked, and nothing bad happens other than a spot of inconvenience

Making Things Convenient

Your next big hurdle will be interpreting the fields as useful string types. Printing a byte array will just show the numeric values for each of the bytes, so let's give Speaker some helper methods which will give us access to the various fields as strings if they are valid UTF8.

impl Speaker {
    pub fn address_line_1(&self) -> Option<&str> {
        std::str::from_utf8(&self.addr1).ok()
    }
    
    pub fn name(&self) -> Option<(&str, &str)> {
        let [first_name, last_name] = &self.name;
        
        let first_name = std::str::from_utf8(first_name).ok()?;
        let last_name = std::str::from_utf8(last_name).ok()?;
        
        Some((first_name, last_name))
    }
}

(the rest of the helper methods are left as an exercise for the reader)

Note that std::str::from_utf8() returns a Result<&str, Utf8Error> where the Utf8Error contains extra information about where it first encountered invalid UTF8 in the byte array. We don't care about that and only want to see if a valid UTF8 string is valid, so we use the ok() method to convert to an Option<&str>. We then use the ? operator to say "if this Option is None, return None, otherwise extract the wrapped value so it can be assigned to first_name".

Packing

There is this concept called "packing" which is quite important in telling the compiler how to lay out our Speaker and spkr structs in memory. We are interpreting a bunch of bytes as a Speaker it is very important for us that Speaker and spkr are laid out identically otherwise we'll get garbage.

You see, processors really like it when things are lined up in memory correctly and they often need to do extra work when they aren't lined up (which kills performance) or will just error out altogether (which kills your program). For example, a u8 can be placed at addresses that are multiples of 1 byte (i.e. anywhere), a u32 can be placed at multiples of 4 bytes, and so on.

To deal with this alignment issue, compilers will insert spacing between fields to make sure they line up correctly - this is actually what the #[repr(C)] attribute does. If we want to tell the compiler not to insert this spacing we can use #[repr(packed)] to tell the compiler "this struct's bytes must be packed together as closely as possible".

We get lucky here because the cng field is at offset 202, which is a multiple of 2 bytes, but otherwise we would need to use #[repr(packed)] instead of #[repr(C)].

Better Alternatives

There are a lot of better alternatives out there. If I was doing this as part of a commercial product where I didn't have full control over the input I would definitely reach for a better tool.

In this case, probably a parsing library like nom.

7 Likes

I was busy typing a reply at the same time :slight_smile: And nom is the first thing I thought of too -- it's really fun to use, but there's a bit of a learning curve if you're not used to parser-combinators. Knowing how to use it is extremely useful, though, so it might be worth the time.

What I was playing around with was using the bytes to do the parsing directly. It's an easy way to read parts out of a slice/array sequentially, so it's quite easy to build a parser with it. The example below shows how you could use it (link Rust Playground)

use bytes::{Buf, Bytes};

fn main() {
    let bytes = b"this is a name\0                         this is another name\0                   \x01\x01\x00\x00 --and so on---";
    println!("{:?}",bytes);

    let mut buf = Bytes::from(&bytes[..]);

    // extract fields
    let first = buf.copy_to_bytes(40);
    let second = buf.copy_to_bytes(40);
    
    // WORD is probably 32-bits, signed, little endian, 
    // but you may need to do some experimenting to get this 
    // right. 
    let word = buf.get_i32_le(); 

    // string conversion
    let first_str = convert_string_stop_null(&first).expect("string");
    let second_str = convert_string_stop_null(&second).expect("string");
    
    println!("first:  {}.", first_str);
    println!("second: {}.", second_str);
    println!("word:   {}.", word);

    // remaining buffer 
    println!("remaining buffer {:?}", buf);
}

// String conversion, assuming there's a null terminator somewhere.
// Find the null and extract everything before it.
fn convert_string_stop_null(raw: &[u8]) -> Option<String> {
    let v = raw.split(|b| *b == 0).nth(0)?;
    Some(String::from_utf8_lossy(v).to_string())
}

You'll need to play around with it a bit. I'm not sure what the assumed WORD size is, and the strings may or may not be null-terminated. Having a look at the file with hexdump -C or a hex editor is probably a good way to verify these things.

3 Likes

TL;DR: Look at my answer if you want to know how the C++ is serializing/deserializing the *.dat files and look at @mike-barber's answer if you want to do it more reliably/safely/idiomatically/better.

1 Like

@Michael-F-Bryan, @mike-barber Wow! What a wonderful explanation! I'm going to take a crack at both today while the wifey makes chili!

Thank you both! I'll come back soon with any additional questions OR, hopefully, results!

Thank you again to both of you for the detailed and clear explanation, I really appreciate it.

Be careful with the rust String type in this exercise. There's a very good chance you'll run into a non each character, and that it will be encoded with a different encoding, in which case from_utf8 will throw an error. If you want to keep the data intact, you'll need to use a more flexible string type, like OsString, or bstr (from an external crate), or just Vec. If you'll be happy just to have mostly legible ascii text, or if the encoding actually is Utf8, you can use the standard String type, but creating it with String:: from_utf8_lossy, which will convert invalid characters to a character representing unintelligible data, instead of returning an error.

1 Like

My solution to this was to give downstream users direct access to the field while also having a getter that returns Option<&str> to give you access to the text if it is valid UTF-8.

1 Like

Okay, so ran the code you posted and at first I thought it didn't work because I got a lot of 0s in the fields of the struct. Then I noticed the name and last name fields

An example of what the output I received (changed the values for privacy):

Speaker: Speaker { name: [[21, 93, 111, 111, 111, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [12, 95, 107, 18, 111, 21, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], addr1: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], addr2: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], phone: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], tks: [32, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], tfs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ems: 2, asn: 0, cng: 14, num: 0, xtp: 0, blk: 0, spr: 0, spare: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] }

I opened the struct.dat with ghex and took a look at the first entry and it matched the first and last name field exactly for the 8 bits. The other fields were also correct although the majority are just 0. This particular user is no longer listed in the program so I think this is how a delete is handled.

I modified the code as follows and the first name and last name prints as expected.

// Safety: See below
    let speaker: Speaker = unsafe { std::mem::transmute(buffer) };

    let full_name = match speaker.name() {
        Some(c) => c,
        None => panic!("Name() was empty."),
    };
    // Print out the first and last name.
    print!("Fist name: {}, Last name: {}", full_name.0, full_name.1);

Now I need to get the rest of the speakers out of the data file so I'm guessing I need to step forward in the buffer by x bytes. Want to get that going before moving on to using nom

Moving forward in the buffer

let mut buffer = [0_u8; std::mem::size_of::<Speaker>()];

I'm guessing that here the buffer is starting at 0_u8 and extending to the size of Speaker. Now thinking of how I can loop through the file while incrementing those values.

Yeah, obviously the name field wasn't long enough to fill the full array so it got padded out with zeroes.

When you print a string in C it'll stop printing at the first zero, whereas in Rust we need to manually make sure we ignore the trailing zeroes when interpreting the &[u8] as &str (hence the getter methods and convenience function).

If you are reading from a slice that is already in memory you can use slice's split_at() method to grab the bytes before and after a particular index. Then it's just a case of copying the bytes from the first bit into a Speaker and updating your original slice to point at the last bit.

I haven't tried running it, but you can do something like this:

const SPEAKER_SIZE: usize = std::mem::size_of::<Speaker>();

fn load_speakers(mut raw_data: &[u8]) -> Vec<Speaker> {
    let mut speakers = Vec::new();

    while raw_data.len() >= SPEAKER_SIZE {
        let (head, tail) = raw_data.split_at(SPEAKER_SIZE);

        let speaker = Speaker::load_from_bytes_somehow(head);
        speakers.push(speaker);

        raw_data = tail;
    }

    speakers
}

That line says "give me an array filled with 0 bytes who's length is equal to Speaker's size in bytes".

If you are familiar with C, it's roughly equivalent to this:

typedef struct
 {
    ...
 } Speaker;

int main() {
    uint8_t buffer[sizeof(Speaker)] = {0};
    return 0;
}
1 Like

After 6 years, I was told I might need to keep track of speakers any longer. lol

I'm still going to try to tackle this though. I'm going to try out nom.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.