Overwhelmed by the vast variety of serialization formats. Which to use when?

jbe · January 30, 2023, 3:28pm

In a couple of scenarios, I need to exchange data over a network. In some of those scenarios, the format is pretty rigid, in other cases it's more dynamic, i.e. with an evolving format where fields might be added over time.

I guess using serde is a good way to go, but which serialization format to use in which scenario?

I'm generally fond of simple and easy to understand formats, but I also need to be 8-bit clean (as in transmitting opaque 8-bit data as part of the messages), which almost rules out JSON, I guess (as base64 encoding or representing 8-bit data as an array of numbers doesn't seem to be very efficient).

As far as I understood, serialization formats fall more or less into one of two categories:

self-describing formats (such as JSON, CBOR, MessagePack, Pot)
non self-describing formats (such as Postcard)

Please correct me if I'm wrong.

However, I feel like the boundary between those is somewhat blurry. That is because the type systems are not consisent over all serialization formats. I cannot really "describe" a timestamp in JSON (i.e. a UNIX time stamp would be encoded the same way as an integer, or encoded as a string), but I can tag a value as timestamp in CBOR or MessgePack; whereas in Postcard, I can't even distinguish between numbers and strings in the binary message (and I must know whether to expect a number or string).

On the other hand, I can perfectly encode a JSON document in the Postcard format (the following code assumes that the enum discriminants of serde_json::Value are stable):

let json: serde_json::Value = serde_json::from_str("{\"A\": true}").unwrap();
let bytes = postcard::to_stdvec(&json).unwrap();
println!("{bytes:?}");

Output:

[1, 1, 65, 1]

I read here that CBOR was inspired by MessagePack. What does CBOR do differently in regard to MessagePack? And how about "Pot" or other formats?

What would be your general advice when choosing a serialization format? Which formats are well-established? Which ones are know to cause trouble in certain scenarios?

Heliozoa · January 30, 2023, 4:20pm

When using serde in my opinion the distinction between self-describing or not isn't that important. Either way you probably have a struct or enum that completely describes the format you're using. I would say the more important distinction is whether the format is human readable or not.

If I don't need the messages to be human readable, I default to bincode: https://docs.rs/bincode/

If I do, then I go for JSON with serde_json. As much as I don't like JSON, I've been bitten by weird yaml edge cases too many times and toml becomes very confusing when your data isn't very flat (tables within tables...).

I think it's just a confusing name, and the actual difference is whether an arbitrary message of said format can be parsed. You can turn any JSON into a serde_json::Value, but the same is not true for postcard; as you said there's no way to know the meaning of a given byte without knowing the format.

jofas · January 30, 2023, 4:34pm

I find the question of what you are trying to develop interesting for choosing the right protocols. Most formats arise from certain industries/environments and are therefore more common in that given field (i.e. JSON in web development, pickle in Python-world, protobufs in gRPC, binary formats where package size is important like on embedded devices and so on). It would be weird to me if you were to choose toml for a REST service, for example.

Also I very much enjoy working with serde. So given the ability, I'd probably choose a protocol that is supported by serde. But be aware that there is serialization support beyond serde in rust, though. An example would be prost as a protocol buffers implementation.

jbe · January 30, 2023, 5:47pm

The final defintion of the struct/enum is yet to be decided upon, and it will be passed to sandboxed script interpreters as well as being transmitted over the network. Depending on the interchange format, I might make design decisions in regard to that struct/enum. If I want interoperability with JSON, it could simply end up being serde_json::Value (where I have trouble with 8-bit data), but I'm not sure yet.

That seems pretty similar to Postcard, just differently flavored. What I liked about Postcard is the simple specification. Bincode, in contrast has different ways/variants to encode integers. If I understand it right, then the only real difference between Postcard and bincode is how numbers are encoded:

Postcard uses LEB128 (always little-endian) with ZigZag encoding for signed integers
bincode must/can be configured in regard to endianess (big, little, native endianess) and whether integers are stored
- with bincode's VarintEncoding
- or as fixed width integers

So a "self-describing format" is a format for which there exists some type (e.g. serde_json::Value) which every valid document can be converted into. But the usefulness of such a dynamic value can vary a lot, depending on the nature of that type. If I understand it right then, a non self-describing format doesn't have such a type into which that format can always be parsed.^[1]

I guess I should generally take a look at Protocol Buffers too (wire format here), to see if it's suitable for my needs.

Though formally there exists the trivial case of "parsing" the document into a Vec<u8> by performing a no-op. This also demonstrates that the usefulness of such type may vary drastically. ↩︎

semicoleon · January 30, 2023, 5:59pm

A self describing format can decode into the format's data model^[1] without requiring knowledge of the structure you encoded. They generally are slightly larger since they have to encode type information somehow, but also more forgiving of slightly mismatched versions of your data structures being sent.

Personally I generally reach for CBOR these days if I don't have a hard requirement on JSON, since it can encode binary data but otherwise generally matches the JSON data model which makes it easy to understand and work with for most people.

That doesn't necessarily mean the decoded data will match exactly the types that were encoded. As you note, encoding things like dates generally have different types in the data model that could be chosen. Decoding that into a date type requires some additional knowledge, but the format itself is still self-describing ↩︎

Heliozoa · January 30, 2023, 6:24pm

Looking at https://github.com/djkoloski/rust_serialization_benchmark it seems like postcard is faster/smaller in most cases, so it might be the better choice between the two if that's the kind of format you end up choosing. When I started using bincode postcard was still very early in development or not even released and I haven't really thought about it since, I'll be sure to give it a try in the future.

ZiCog · January 30, 2023, 6:28pm

Because I'm lazy this recently happened:

We have a remote system collecting data over serial line from an attached device and forwarding those raw bytes over NATS messages to a cloud server for processing.

Then came a requirement to add a timestamp to those messages. So, I just define a struct with a timestamp field and a Vec of bytes. serde converts that to JSON for transmission.

Ahhhgg... you say, that is terrible, all that processing to create and parse JSON, all that wasted bandwidth as each original raw byte is now up to three digits and comma and a space in JSON. And it's not human readable anyway. I thought so too.

Turns out it works just fine. Performance is not noticeably different, CPU usage is not noticeably different. Everyone is happy

CAD97 · January 30, 2023, 6:52pm

FYI, if you're using serde to manage de/serialization, everything goes through the serde data model. There's no^[1] way to pass data through except via said model, so even in formats with richer types on offer, you're limited to the common denominator provided by serde.

There are ways to construct secondary communication channels, like is used for the arbitrary precision feature of serde_json. That's generally quite fragile, though, and there's no standard on how to go about doing so. ↩︎

jbe · January 30, 2023, 7:00pm

I'm not yet so familiar with serde, but I thought the newtype_struct, for example, could be used to pass additional type information through that model.

See Serializer::serialize_newtype_struct. It gets a type name (same as most other trait methods of Serializer also do). Couldn't this be used to encode type information that goes beyond serde's type model?

I think there are some caveats. In particular, I can't provide an implementation of Serialize for 3rd party or std data types such as std::time::Instant (see Playground). And also, for Serializers and Serialize to play well together, there would need to be some common ground on how certain types are named. That name: &'static str argument to the serialize methods is rather usable only as some sort of opaque name, I guess.

Not sure if serde's model is very useful then. (I also was confused by it in the past).

I will try to look into Protocol Buffers and see if the specification and available Rust implementations are usable for my case. With my limited overview, it currently appears most promising (and reasonable) to me.

jbe · January 31, 2023, 11:41am

Apparently Protocol Buffers are capable of storing dynamic structs by using google.protobuf.Struct. If I don't need the ability to do a 1:1 mapping to JSON, I could use something similar but include a bytes scalar instead of a string scalar, or even both.

So Protocol Buffers seem to cover a wide range of scenarios, both where little is known about the transmitted data, as well as cases where the data is pretty rigid (in which case you don't need to transmit field names).

I see there are several Rust crates which address Protocol Buffers:

protobuf (currently is looking for a new maintainer)
quick-protobuf
prost (as already mentioned by @jofas)
serde-protobuf (Maybe still in development? Says it doesn't support serialization.)

Any advice/comments on the first three crates listed? Edit: I opened a new thread regarding Protocol Buffers.

jbe · February 1, 2023, 3:32pm

For now, I'll settle down with the following:

Using JSON where I either
- need interoperability with JSON, or
- want a wire format that is human readable (i.e. human readable without prior conversion).
Using Postcard where the underlying data structure doesn't change and where I don't need forward/backward compatibilty.
Using Protocol Buffers where I need forward and/or backward compatiblity, i.e. where fields can be missing and/or will be added later.

For the Rust side in regard to Protocol Buffers, I'll likely use prost. As pointed out by the developers, it seems to be tricky to integrate that with serde (see prost FAQ).

Using the protoc command line tool, I can convert any Protocol Buffer binary message into a human-readable format (see Protobuf Text Format) and vice versa, as long as I have the corresponding *.proto file which defines the data structure.

maraist · February 5, 2023, 5:33pm

If you ever want to debug, json and CBOR are great. You can debug dump to a file and easily make it human readable (cbor.io for cbor).

Beyond that, both formats are heavy for arrays of data. Each subfield gets VERY heavy. Taking an array of 3d points would be horrible in either of these formats for more than a few thousand entries.

I've been dabbling with ZeroVec with serde to allow json output in expensive mode (csvs with named x, y, z headers) but cbor will compact the ZeroVec to a contiguous fixed length array of primitives. So 100million elements is a single cbor blob read/write. It's the best of both worlds, but becomes intrusive to your code.

A similar option is rkyv. It deviates completely from serde but basically allows mem mapping of an entire rust struct tree to disk. (similar to captain proto). It has no human readable output and won't work with anything but rust.

A more portable version of rkyv is flatbuffer (by Google). This has C/C++/Javascript bindings to an IDL (like protobuf) but can be memory mapped like rkyv to allow multi million record vectors without per element transcoding.

Msgpack and cbor are very similar, but I think msgpack DEFAULTS to removing field names (both cbor and msgpack can toggle that on off).

I'd stay stick with sede format. Postcard with ZeroVec is very fast - with json/cbor it's very portable.

system · May 6, 2023, 5:34pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Serde serialize to not supported type help	21	2649	May 11, 2021
Comparison between rustc-serialize and serde_json	2	1358	January 12, 2023
Saving a complex struct to disk help	45	8569	November 6, 2023
Multiple incompatible Value implementations	4	1928	January 12, 2023
How should I be separating User Display formatting from System formatting in custom structs?	13	978	January 12, 2023

Overwhelmed by the vast variety of serialization formats. Which to use when?

Related topics