Compact Binary Serialization with Serde?


#1

I want to serialize a vector, but using only u32 or less for its length information. Serde uses by default u64 and since there are potentially a lot of small vectors, a lot of space would be wasted.

Is this possible with serde and does it make sense to switch to serde? In the docs there is a method for sequences which always has the length and therefore its type included, https://docs.serde.rs/serde/trait.Serializer.html#tymethod.serialize_seq, which doesn’t seem to help here.


#2

The binary representation of sequences depends on the data format you are using, not on Serde. For example MessagePack serializes lengths as 1 byte if under 16, 2 bytes if under 65536, and 4 bytes otherwise (implementation) – so even your u32 length would seem wasteful in comparison.

That said, if you care about compactness you are going to see better results from using a real compression algorithm applied to the entire serialized data. At that point it will hardly matter what underlying format you use. Try chaining together bincode::serialize_into with a DeflateEncoder or similar from the flate2 crate.


#3

Thanks, I was looking for a real example. Currently I do it almost exactly the same as MessagePack, but since I serialize Vec<u32> I also apply this to the whole vector. Since the numbers consist mostly of small values this already compresses better and is faster than a general purpose compression algorithm (Vint Snappy Comparison)

The link to the code is the actual conversion, which gets referenced from the serde serializer here.
I was browsing a lot of implementations and most of them have something like this, a struct containing a writer, where the data is written to.

impl<'a, W> serde::Serializer for &'a mut Serializer<W>
where
    W: Write

Is this best practice? It’s a little confusing, since I didn’t find this pattern in the serde documentation.


#4

FWIW my favorite serialization format for resource-constrained environments is CBOR. I’ve used serde_cbor in multiple projects with excellent results. Highly recommended over any DIY solution.


#5

I previously tested rust_cbor, but performance is quite slow with only 50MB/s, compared to 1GB/s of my solution. I don’t know if this is due to rustc_serialize being used.

serde_cbor is much faster with around 350MB/s, but compression is even worse than bincode for ranges like (199_990..200_000) already.

I would rather use something existing, but compression ratio and speed are crucial in this component.


#6

That’s a good point. It is worth mentioning that CBOR is designed as a JSON-like replacement for resource constrained environments, not for raw performance.


#7

FWIW, a while ago I forked bincode to implement variably sized encoding of integers (leb128), and packing of floats in different options (e.g. as f16) and it also supports bit vectors etc.
You can’t encode the data any smaller than this without extracting patterns into a lookup table like in a compression algo:

All integer types use variable length encoding, taking only the necessary number of bytes. This includes e.g. enum tags, Vec lengths and the elements of Vecs. Tuples and structs are encoded by encoding their fields one-by-one, and enums are encoded by first writing out the tag representing the variant and then the contents. Floats can be encoded in their original precision, half precision (f16), always f32 or at half of their original precision.

But since I did this for a fast-paced multiplayer game project (using Enet) that I stopped working on, I haven’t updated it since (but it works well, and all the tests pass!), because I haven’t really needed it, and serde changed the API a lot after that (this was pre-1.0 serde).

I’ll gladly accept PRs on it that bring it up to speed with the recent serde/bincode version :slight_smile: