Choice of format for serialisation

I am having trouble finding a good choice for serialising data. I have been using msgpack , but recently realised for large byte Vecs, it doesn't do such a great job. As a binary format, I would expect the serialised data to have only a fixed overhead, but it seems that is not the case.

For example, serialising say a 2.3 MB Vec of bytes, I would expect the result to be maybe 2.3MB plus a few dozen bytes at most. Instead it seems to be about 3.6MB.

Is there a better choice?

It seems you're serializing it as base64 string by accident. It works on my simple test.

Cargo.toml:

[package]
name = "foo"
version = "0.1.0"
edition = "2021"

[dependencies]
rmp-serde = "1.1"
serde = { version = "1", features = ["derive"] }

src/main.rs:

#[derive(Debug, serde::Serialize)]
struct Msg {
    foo: Vec<u8>,
}

fn main() {
    let msg = Msg { foo: vec![42; 2_300_000] };
    let pack = rmp_serde::to_vec(&msg).unwrap();
    println!("size: {}", pack.len());
}

Result:

size: 2300006
2 Likes

Hmm, thanks. I must be missing something. Will investigate what has happened.

Aha. If you change 42 to 250, the output is

size: 4600006

This is because msgpack allows numbers to be encoded differently depending on the value, to allow for efficiency. See msgpack-rust/uint.rs at master · 3Hren/msgpack-rust · GitHub. Note that a u8 takes up 2 bytes this way.

Also note that msgpack allows for effiecient sending off Vec<u8> via msgpack/spec.md at master · msgpack/msgpack · GitHub and for other Vec's via msgpack/spec.md at master · msgpack/msgpack · GitHub. I'm just having a hard time figuring out how to do it via rmp-serde. In my own usage of rmp I've been using write_value in rmpv::encode - Rust.

2 Likes

Oh, in that case it's the problem that the serde can't specialize Vec<u8> over its generic Vec<T> impl. If you control the serialized types, try containers other than Vec<u8> specialized for bytes, like ones from the bytes crate or the BString from the bstr crate.

Two other possibilities:

  1. Use a format that natively supports binary blobs, for example my Neodyn Exchange crate:

    use anyhow::Result;
    use neodyn_xc::Value;
    
    fn main() -> Result<()> {
        let v: Vec<u8> = (0..=u8::MAX).cycle().take(u16::MAX.into()).collect();
        let vlen = v.len();
        let neodyned = neodyn_xc::to_bytes(&Value::Blob(v))?;
    
        println!("v.len() = {}, neodyned.len() = {}", vlen, neodyned.len());
     
         Ok(())
     }
    
  2. Use a non-self-describing format, such as bincode, which can serialize to the most compact representation possible, as the (static) type information will always be provided (and required): Playground – the problem with this approach is that dynamically typed deserialization won't work.

    use serde_json::Value;
    
    fn main() -> Result<()> {
        let v: Vec<u8> = (0..=u8::MAX).cycle().take(u16::MAX.into()).collect();
        let bincoded = bincode::serialize(&v)?;
        println!("v.len() = {}, bincoded.len() = {}", v.len(), bincoded.len());
        let value: Value = bincode::deserialize(&bincoded)?; // this fails
        Ok(())
    }
    
1 Like

Thanks, I have switched to bincode which seems to be working fine, and is producing serialised values of the expected size.