Choice of format for serialisation

I am having trouble finding a good choice for serialising data. I have been using msgpack , but recently realised for large byte Vecs, it doesn't do such a great job. As a binary format, I would expect the serialised data to have only a fixed overhead, but it seems that is not the case.

For example, serialising say a 2.3 MB Vec of bytes, I would expect the result to be maybe 2.3MB plus a few dozen bytes at most. Instead it seems to be about 3.6MB.

Is there a better choice?

It seems you're serializing it as base64 string by accident. It works on my simple test.

Cargo.toml:

[package]
name = "foo"
version = "0.1.0"
edition = "2021"

[dependencies]
rmp-serde = "1.1"
serde = { version = "1", features = ["derive"] }

src/main.rs:

#[derive(Debug, serde::Serialize)]
struct Msg {
    foo: Vec<u8>,
}

fn main() {
    let msg = Msg { foo: vec![42; 2_300_000] };
    let pack = rmp_serde::to_vec(&msg).unwrap();
    println!("size: {}", pack.len());
}

Result:

size: 2300006
2 Likes

Hmm, thanks. I must be missing something. Will investigate what has happened.

Aha. If you change 42 to 250, the output is

size: 4600006

This is because msgpack allows numbers to be encoded differently depending on the value, to allow for efficiency. See msgpack-rust/uint.rs at master · 3Hren/msgpack-rust · GitHub. Note that a u8 takes up 2 bytes this way.

Also note that msgpack allows for effiecient sending off Vec<u8> via msgpack/spec.md at master · msgpack/msgpack · GitHub and for other Vec's via msgpack/spec.md at master · msgpack/msgpack · GitHub. I'm just having a hard time figuring out how to do it via rmp-serde. In my own usage of rmp I've been using write_value in rmpv::encode - Rust.

2 Likes

Oh, in that case it's the problem that the serde can't specialize Vec<u8> over its generic Vec<T> impl. If you control the serialized types, try containers other than Vec<u8> specialized for bytes, like ones from the bytes crate or the BString from the bstr crate.

Two other possibilities:

  1. Use a format that natively supports binary blobs, for example my Neodyn Exchange crate:

    use anyhow::Result;
    use neodyn_xc::Value;
    
    fn main() -> Result<()> {
        let v: Vec<u8> = (0..=u8::MAX).cycle().take(u16::MAX.into()).collect();
        let vlen = v.len();
        let neodyned = neodyn_xc::to_bytes(&Value::Blob(v))?;
    
        println!("v.len() = {}, neodyned.len() = {}", vlen, neodyned.len());
     
         Ok(())
     }
    
  2. Use a non-self-describing format, such as bincode, which can serialize to the most compact representation possible, as the (static) type information will always be provided (and required): Playground – the problem with this approach is that dynamically typed deserialization won't work.

    use serde_json::Value;
    
    fn main() -> Result<()> {
        let v: Vec<u8> = (0..=u8::MAX).cycle().take(u16::MAX.into()).collect();
        let bincoded = bincode::serialize(&v)?;
        println!("v.len() = {}, bincoded.len() = {}", v.len(), bincoded.len());
        let value: Value = bincode::deserialize(&bincoded)?; // this fails
        Ok(())
    }
    
1 Like

Thanks, I have switched to bincode which seems to be working fine, and is producing serialised values of the expected size.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.