Binary serialization in custom format

Hi all, I'm pretty new to Rust and I was trying to implement a custom binary serialization. I've seen that serde (and bincode) seem to be the standard de facto in Rust, which is actually nice.
My problem is that I have custom structs (bringing it from the java world) and I want to write bytes in a bytebuffer (DataOutputStream in Java) to write down my own format.

The struct is very simple

pub struct MyStruct {
        size: u32,
        offsets: Vec<u32>,
        strings: Vec<u8>,
    }

size indicates how many strings I'm storing
offsets indicates where the string start and ends on the string vector
strings is the vector of all the strings one after the other.

For instance, when storing "foo" "bar I will have

  • size to 2
  • offsets to be: [16, 19, 22] that represent where the strings start and ends in the sequence of bytes
  • strings to be [98, 97, 114, 102, 111, 111] (the byte representation for "foobar"

The structure serialized will give this sequence of bytes:
[0, 0, 0, 2, 0, 0, 0, 16, 0, 0, 0, 19, 0, 0, 0, 22, 98, 97, 114, 102, 111, 111]

As you can see there is size 2 (the first 4 bytes), the offsets to see where the string starts and end (foo start after 16 for 22-16=3 bytes, bar starts after 19 for 22-19=3 bytes).

Now, doing this in java is pretty simple (sorry for the comparison). When serializing I just do a bunch of DataOutputStream.writeInt(), DataOutputStream.write() and when deserializing I do the opposite: ByteBuffer.getInt(), ByteBuffer.get() in the corresponding for loops.

I was trying to do the same in Rust, and I got to know serde which looked a pretty fancy idea to remove a bunch of boilerplate I didn't really need from the java variant. But I've ended up with even more code so I suspect I'm doing something wrong.

Here's my implementation:

#[cfg(test)]
mod test {
    use super::*;
    use bincode::Options;
    use serde::ser::SerializeStruct;
    use serde::{Deserialize, Serialize, Serializer, Deserializer, de};
    use serde::de::{Visitor, SeqAccess};
    use std::fmt;

    #[derive(PartialEq, Debug)]
    pub struct MyStruct {
        size: u32,
        offsets: Vec<u32>,
        strings: Vec<u8>,
    }

    impl Serialize for MyStruct {
        fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
        where
            S: Serializer,
        {
            let mut state = serializer.serialize_struct("MyStruct", 3)?;
            state.serialize_field("size", &self.size)?;
            for i in &self.offsets {
                state.serialize_field("offsets", i)?;
            }
            for o in &self.strings {
                state.serialize_field("strings", o)?;
            }
            state.end()
        }
    }

    impl<'de> Deserialize<'de> for MyStruct {
        fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
            where
                D: Deserializer<'de>,
        {
            struct SymbolTableVisitor;

            impl<'de> Visitor<'de> for SymbolTableVisitor {
                type Value = MyStruct;

                fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
                    formatter.write_str("MyStruct")
                }

                fn visit_seq<V>(self, mut seq: V) -> Result<MyStruct, V::Error>
                    where
                        V: SeqAccess<'de>,
                {
                    let size: u32 = seq
                        .next_element()?
                        .ok_or_else(|| de::Error::invalid_length(0, &self))?;
                    println!("Deserialized size: {}", size);
                    let capacity = (size + 1) as usize;
                    let mut offsets = Vec::with_capacity(capacity);
                    let mut strings = Vec::with_capacity(size as usize);

                    for i in 0..capacity {
                        let element = seq
                            .next_element()?
                            .ok_or_else(|| de::Error::invalid_length(i, &self))?;
                        println!("Deserialized element: {}", element);
                        offsets.push(element)
                    }

                    for i in 0..size {
                        let symbol = seq
                            .next_element()?
                            .ok_or_else(|| de::Error::invalid_length(i as usize, &self))?;
                        println!("Deserialized symbol: {}", symbol);
                        strings.push(symbol)
                    }

                    Ok(MyStruct {
                        size,
                        offsets,
                        strings,
                    })
                }
            }

            const FIELDS: &'static [&'static str] = &["size", "offsets", "strings"];

            deserializer.deserialize_struct("MyStruct", FIELDS, SymbolTableVisitor)
        }
    }

    #[test]
    fn it_can_get_serialized_and_deserialized() {
        let mut offsets = Vec::with_capacity(2);
        offsets.push(1);
        offsets.push(2);

        let mut strings = Vec::with_capacity(2);
        strings.append(&mut "hello".to_string().into_bytes());
        strings.append(&mut "world".to_string().into_bytes());

        let my_struct = MyStruct {
            size: 2,
            offsets,
            strings,
        };

        let my_options = bincode::DefaultOptions::new()
            .with_fixint_encoding()
            .with_big_endian();

        let serialized = my_options.serialize(&my_struct).unwrap();
        println!("serialized = {:?}", serialized);

        let deserialized: MyStruct = my_options.deserialize(&serialized).unwrap();
        println!("deserialized = {:?}", deserialized);
    }
}

Spoiler: the test doesn't work.

I have many questions on this snippet, trying to summarize them here:

  • Is this really the best practice to ser/deser a custom format or should I change direction here? It seems that serde/bincode are very good as long as you don't have format constraints, but when you want to do something more custom they start to seem too restrictive?
  • deserialization doesn't work because I suspect I've given 3 fields, but trying to deserialize more: next_element is counting as a field, instead I'm trying to deserialize part of the field.. not sure how to solve this.
  • not quite sure why I need to name fields and struct since I'm never using them. this happens for instance in deserialize_struct, serialize_field, serialize_struct, write_str
  • one other question is on all the usize to u32 convertions... should I really do all of that?

Sorry for the long post... hope that you guys can give me a lead of what am I doing wrong since I thought this would have been easier that what I'm experiencing..
Thanks a lot!

What is "this"? If you are asking whether you should use Serde to implement a custom serialization format, then the answer is probably "yes".


As for the boilerplate: you could just #[derive(Serialize, Deserialize)] the implementations of the two traits on your MyStruct type.

However, I don't see you are implementing Serializer and Deserializer (note the "r"!) anywhere, which is required for any format. These are the traits that should define your binary format, not the Serialize and Deserialize implementations of any particular user-defined type.

For an example on how to implement a binary serializer, you can have a look at my neodyn_xc crate, for instance. I specifically wrote this crate with readability in mind โ€“ the code is not (yet) full of hard-to-grasp performance tricks like some of the more mature formats in the ecosystem, so it should hopefully be easier to understand than, for instance, bincode.

If you want to use a custom format you can't use bincode. bincode is its own format, like JSON.

Using serde here is probably overkill. Serde is great when you have some full-fledged data format, like JSON or TOML, and you want people to easily convert structs that implement Serialize to said format, and from that format to a struct that implements Deserialize. If I'm understanding correctly, you have a specific struct you want to turn into a specific kind of array of bytes and back. For something like that I would just write methods

impl MyStruct {
    pub fn serialize(self) -> Vec<u8> {
        todo!()
    }

    pub fn deserialize(bytes: &[u8]) -> Self {
        todo!()
    }

yeah by this I mean if serde was the right way to go in those circumstances. Thanks for the reference, will give it a look!

ok this is interesting, MyStruct is just one struct that will be used together with other structs.. so in short the format will be a bit bigger than just the one I've exampled. But what you say makes sense to me.. So you are suggesting to implement serde serializer only if there is a real complex format as Json, but if I have my own format, used only by my module, then to go the "plain" way and just implement the de/serialize on a custom method?
I wonder, (probably a bit off topic) if I go this way, I will have for all my structs the bytes in memory (consider for instance 1B of strings), then when I return from serialize, is there a way to flush on a file/stream as OutputStream does?

If the format is simple and you only use it with your own types (and there aren't that many of them) then I would probably opt for writing the methods without serde. Writing a custom Serializer/Deserializer is a bit more work: Writing a data format ยท Serde, though you would only have to write it once and then you could just use #[derive(Serialize, Deserialize)] on any of your structs, so it could be worth it if you have a lot of types or you expect them to change often.

If I'm understanding correctly, you could make the methods look something like

fn serialize<W>(self, to: W)
where
    W: Write
{
    todo!()
}

and just write all the bytes into to as they are serialized. In this way, you could for example call the function with a file. Same for deserialize, except using the Read trait. You'd still have all the strings in memory in the struct that you're (de)serializing, though. If you want to avoid that you'd need to think of some other solution that doesn't involve a struct with a strings: Vec<u8> field.

Thanks!

I think I can live with the strings in memory as long as the writer will buffer on a stream so I won't have the data duplicated in memory..
But I'm curious if you had any suggestion on a solution that doesn't involve a struct with strings: Vec<8>?

That really depends on the rest of your application, it could be that there's no good way to avoid it. For example, if you're reading these strings from somewhere and just want to serialize them to a file you could have a function that takes both a generic R: Read which gives you the strings and a W: Write where you write the serialized output.

1 Like

My friend, have a look at deku.
It's criminally underrated.

1 Like

just as a followup, I was experimenting this path and there is for sure something that i'm missing.
When I use a W: Write it seems I can only write u[8], but I was expecting I could use something like write_u32 and such? am I missing something?

byteorder crate provides them.

sorry about the newbie question.. but if I do something like

pub fn serialize<W>(&self, to: &mut W) where W: Write {
       //size is of type of usize
        to.write_u32(u32::try_from(self.size).unwrap());

how can I use byteorder? should I just import it? not clear to me from the doc how to use it :frowning:

You can copy the example in the doc.

use byteorder::{BigEndian, WriteBytesExt};

let size: u32 = ...;
to.write_u32::<BigEndian>(size).unwrap();

Assuming you want the number to be stored as big endian, like the number 0x12345678 stored as byte sequence of 0x12, 0x34, 0x56, 0x78 in order. Use little endian if you want to store bytes in reverse order.

yeah that's what I thought, but when I try to import I get compile error

 use byteorder::{BigEndian, WriteBytesExt};
  |     ^^^^^^^^^ use of undeclared crate or module `byteorder`

isn't this part of the standard lib?

No, it's an external crate, you'll need to stick it in your Cargo.toml manifest. crates.io/crates/byteorder

ah I see makes sense.. isn't any "native" crate that I can use to achieve the same?

byteorder is the widely used, effectively std crate to do this :slight_smile:

1 Like

You can do it with the standard library like this:

to.write_all(&u32::to_be_bytes(size))?;
3 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.