Correct way to extract type of data from a binary file

Hello.

At the moment I work with binary files in which different types of data are contained. The data can be 1, 2, 4, 8, 16 bytes long.

At the beginning of the file the magic number is contained, which has information about data type in this file. So I parse this magic number and obtain information about data type.

In order to work with this data I created a enum

enum DataType{
    Ubyte(u8),
    Int(i32),
    Double(f64),
    ...
}

and a struct in which I have a

struct Data{
    data: Vec<u8>
}

I also implemented an Index trait for this struct in order to extract the actual data by index and Iter trait as well.

However, the actual usage of all this constructs is not convenient since every time I got a enum I need to match over it in order to extract the actual data

let num = match data {
    DataType::Ubyte(x) => x,
    ....
};

So I wonder, what is the correct way to deal with this case?

Thank you very much in advance.

It feels like you want to include DataType as a field of Data; the Data struct owns the bytes, and also knows how to interpret those bytes. Implement Index and Iterator for Data, and you should be able to interact with your generic data without matching on the enum directly; let the trait impls do that.

I’m confused by the description. It sounds like you have a file which contains:

  • A single header
  • Followed by many values of the corresponding type

Is that right? Then I would write:

// Private; this only exists so you can write
// a function that parses the header and returns
// the data type.
//
// You might not even need it. E.g. you could
// use a Data with an empty vec.
enum Tag {
    Ubyte,
    Int,
    Double,
}

// A Vec of enums (or iterator of enums, etc.)
// is conceptually incorrect if all of the
// enums have the same variant.
//
// You want an enum of vectors.
pub enum Data {
    Ubyte(Vec<u8>),
    Int(Vec<i32>),
    Double(Vec<f64>),
}

Sorry that the description was confusing.

Being more precise the structure of the file is the following

offset                                      type                         basic file format
0000                                        i32                          magic number
0004                                        data type from magic number  data
+1,+2,+4,+8,+16 depending on the data type  data type from magic number  data
+1,+2,+4,+8,+16 depending on the data type  data type from magic number  data
....

all data is of the same type of course.

In what I wrote I did not have a vector of enums.
I had a struct in which I had a Vec<u8> and in this vector I stored the actual data from the file, which is u8. In order to obtain the data in actual format I used from_be_bytes() function. The Index and Iter were implemented for that struct.

I agree that enum of a vector maybe better, but my question still applies:

when I parse everything and get Data I have to match over it in order to extract the data, either enum of vectors or how I did it enum of values, actually In my notations from the first post

let myData = Data::open("file_name");
let myNum: DataType = myData.index(10);
let actualNum = match myNum {
    DataType::Ubyte(x) => x,
    DataType::Int(x) => x,
...
};

In the case of enum of vectors I have to do the same thing.
So my question was whether there is a way of avoiding matching every time.

Sorry but I did not exactly get what you mean. I understand that DataType should be a field in Data. but what shall I do next.

The problem here is that you haven’t told us what you want to do with the data. Since you have data of different types you have to decide what to do with each type of data, which is why you have a match statement. Since you aren’t satisfied with this, there must be some commonality between the code for the different types, but you haven’t hinted at what that commonality might be. Without knowing more about your goals, it’s hard to tell you how you might achieve them.

1 Like

Basically I can’t picture a scenario in which you have to match over each individual value.

To me your code example is highly contrived because I cannot imagine a case where I’ve ever needed to look at the nth element of a vector of unknown type read from a file. It also clearly doesn’t reflect your own usage because it doesn’t typecheck. (x has different types in different branches!)

Basically, if different files have different types in them, then I can only imagine that I would want to do different things to them! (there are some exceptions I’ll cover at the end)


Just this Thursday, I had to write a text-based parser of a similar format, and I had no trouble parsing it into an enum of vecs up front. Here is an adaption of my code to your binary format. (Note this requires nom 3.2.1 or lesser, because I have no idea how to use many0! in nom 4).

Data types:

use nom::*;

#[derive(Debug, Copy, Clone, PartialEq, Eq, Hash)]
enum TypeTag { Integer, Real, Complex }

#[derive(Debug, Clone, PartialEq)]
pub enum Data {
    Integer(Vec<i32>),
    Real(Vec<f64>),
    Complex(Vec<(f64, f64)>),
}

Helper parsers:
If you’re not familiar with nom, named!{fn_name<I, O>, ...} defines a parsing function that takes I (either &[u8] or &str) as input and produces some Result<(I, O)> with the parsed value and the unparsed remainder.

named!{integer<&[u8], i32>,        i32!(Endianness::Big)}
named!{real<&[u8], f64>,           map!(u64!(Endianness::Big), f64::from_bits)}
named!{complex<&[u8], (f64, f64)>, pair!(real, real)}
named!{
    type_tag<&[u8], TypeTag>,
    switch!(
        i32!(Endianness::Big),
        0x0000_0001 => value!(TypeTag::Integer)
        | 0x0000_0002 => value!(TypeTag::Real)
        | 0x0000_0003 => value!(TypeTag::Complex)
    )
}

Main parser:
many0!(parser) repeatedly applies a parser and returns a Vec of results. So here, we parse a TypeTag, and:

  • If it’s Integer, we use many0!(integer) to parse a Vec<i32>, then put it in Data::Integer,
  • If it’s Real, we use many0!(real) to parse a Vec<f64>, then put it in Data::Real,
  • etc.
named!{
    file<&[u8], Data>,
    terminated!(
        switch!(
            type_tag,
            TypeTag::Integer => map!(many0!(integer), Data::Integer)
            | TypeTag::Real => map!(many0!(real), Data::Real)
            | TypeTag::Complex => map!(many0!(complex), Data::Complex)
        ),
        eof!()
    )
}

Test:

#[test]
fn test() {
    const INT_TEST: &'static [u8] = b"\
\x00\x00\x00\x01\
\x00\x00\x00\x04\
\x00\x00\x00\x10\
";
    assert_eq!(
        file(INT_TEST).unwrap().1,
        Data::Integer(vec![4, 16]),
    );
}

Now, there are a few exceptions to what I said at the beginning of this post. E.g. It’s possible I might want to get the len of the inside Vec without knowing what type it is.

I generally try to avoid these situations, but if a large number of them pop up and I have no alternative, I do have a technique for dealing with the boilerplate. Basically, the goal is to implement four functions as_ref, as_mut, map and fold that together serve 99% of use cases.

I’ll write something up about this technique next, even if only to give myself something to link to from other threads. I’ll warn you though: it can be pretty costly in terms of syntax, so it’s only useful if you have an extremely large number of variants.

Here’s the next response I alluded to.

I ended up not getting to some of the crazier stuff, as I decided to start with macros (which are actually a pretty ergonomic solution!).

Basically I can’t picture a scenario in which you have to match over each individual value.

Now after thinking about it more, I completely agree that it makes little sense to match over individual value. At the beginning I was thinking that I would read the content into the Vec<u8>, how I wrote, and then when I need to have the individual value from the vector, my index() function will provide the enum.

But now, after reading your code, of course it makes much more sense to convert the content into the Vec of the corresponding type at the beginning and then have a enum of vectors.

Actually when I first start doing it I wrote in C++ just to test (I’m new to rust) and there I did put the content into the vector of the type. However, I did not write the general version for all possible types, only for i32 so I did not realise that enum of individual value is a bad solution. Thanks for pointing it out.

It also clearly doesn’t reflect your own usage because it doesn’t typecheck. ( x has different types in different branches!)

I completely agree.

Thanks a lot. This is really helpful. Now I’m going to dig into details.

1 Like