Single CBOR serde crate


#1

I would love to have a single CBOR crate in Rust, supporting all there already is in @BurntSushi’s rust-cbor, @pyfisch’s serde_cbor, Toralf Wittner’s cbor-codec and possibly more. Now CBOR in Rust seems to be fractured into these crates offering different tradeoffs:

  • rust-cbor supports encoding and decoding of generic cbor::Cbor values (with CBOR tags) and anything implementing rustc_serialize::De/Encodable. Also, it implements ToJson and ToCbor for the generic values cbor::Cbor. However, rustc_serialize has been deprecated for some time now.

  • serde_cbor encodes and decodes anything with serde::De/Serialize, has generic serde_cbor::Value (without CBOR tag support) and can serialize into serde_cbor::Value (in addition to a Writer). Closest to what I want but no longer maintained by the author.

  • cbor-codec encodes and decodes to/from its generic cbor::value::Value type. Also has “direct” en/decoding, but only for primitive types. Has a crate name collision with cbor above. EDIT: Realized that it does not support rustc_serialize.

As I wrote, I would love to have a single polished crate, either called cbor (@BurntSushi offered to include it into his crate with a version bump) or serde_cbor, using SerDe with most of the features of the current crates, and I would be happy to pour some time and care into it. Most or all of the code is already out there anyway. However, I would like to see clearly where we are and what others (for example @BurntSushi, @pyfisch, @oli_obk, @erickt and @dtolnay) think or plan first. Let me know!

Namely I would like to have a generic cbor::Value type (with tags and possibly rich key types), de/serialize types based on SerDe to/from binary and cbor::Value, convert cbor::Value to/from Json. For custom types and CBOR tags see below, but I would be also happy to do with limited tag support for now (e.g. only having them in cbor::Value). Another nice feature woud be to encode structs without field names (i.e. as tuples) to save bytes and time in exchange for portability, but that is just another feature.

CBOR tags

A bit of discussion on CBOR tags: Not the main point of the thread but I feel like we should not neglect it.

There has been some discussion on how to encode specific types with specific encoders. For example, timestamp may be encoded just as i64 (as in JSON and elsewhere) or it may be explicitely tagged with tag 1 as in CBOR RFC 2.4.1.

In general, we would need a mechanism in serde to specialize the type de/encoding implementation for certain formats, this was already discussed here and here. As I understood it, when we have specialization in rustc, one way to do it would be to have serde use a parametric variant of Serialize, something along the lines of (a sketch):

trait SpecSerialize<S> where S: Serializer {
    fn spec_serialize(&self, serializer: S) -> Result<S::Ok, S::Error>
}
// default SpecSerialize for any Serilizer
impl<T, S> SpecSerialize<S> for T where T: Serialize, S: Serializer {
    default fn spec_serialize(&self, serializer: S) -> Result<S::Ok, S::Error> {
        self.serialize(serializer)
    }
}

Then you could write a specialization e.g. for SystemTime and CBOR (again, a sketch):

impl SpecSerialize<CborSerializer> for std::time::SystemTime {
    fn spec_serialize(&self, serializer: CborSerializer) -> Result<CborSerializer::Ok, CborSerializer::Error> {
        serializer.write_tagged_u64(1 /*tag number*/, self.duration_since(UNIX_EPOCH).as_secs())
    }
}

Another way was proposed here, but I am not sure it is ideal (how about having different tags in different formats, or not having a fixed set of tags/types per format? do we need to assume that all format-variations behave as some form of “tags” even for complex types?). However, I am not sure what is the current consensus in serde and would like to hear.

Note that, as of 2017-12, specialization is still not fully working on nightly for type parameters (see e.g. #38516).

I would argue that more work on CBOR support even without tags is already really useful and worth polishing it a bit. And quite likely, adding tags when serde supports specialization should be possible.


#2

cc @sfackler (He has improved the serde_cbor crate a lot.)


#3

I basically have no immediate plans to maintain or improve the cbor crate. I’d be happy to relinquish the name assuming the next owner does a semver bump.


#4

I was curious so I made a very naive comparison of the libraries, just for an estimate of current and possible optimizations, using a simple benchmark. I got (on my Thinkpad X260, Core i5):

test decode_cbor             ... bench:   8,426,832 ns/iter (+/- 810,953)
test decode_cbor_value       ... bench:   6,759,760 ns/iter (+/- 569,747)
test decode_serde_cbor       ... bench:   1,510,933 ns/iter (+/- 120,446)
test decode_serde_cbor_value ... bench:   4,869,009 ns/iter (+/- 221,384)
test decode_serde_json       ... bench:   2,659,761 ns/iter (+/- 211,536)
# cbor: 12 MB/s, cbor generic: 16 MB/s, serde: 69 MB/s, serde generic: 22 MB/s

test encode_cbor             ... cbor len: 108930 bench:     380,511 ns/iter (+/- 84,328)
test encode_cbor_value       ... cbor len: 108930 bench:   1,996,772 ns/iter (+/- 1,309,161)
test encode_serde_cbor       ... cbor len: 108930 bench:     278,534 ns/iter (+/- 88,064)
test encode_serde_cbor_value ... cbor len: 108930 bench:     683,648 ns/iter (+/- 230,185)
test encode_serde_json       ... json len: 185601 bench:     981,522 ns/iter (+/- 174,770)
# cbor: 273 MB/s, cbor generic: 52 MB/s, serde: 371 MB/s, serde generic: 152 MB/s

So it seems that serde_cbor is generally faster (at least on this joke of a test with ~100kB of CBOR per item).


#5

Specialization is probably the best long-term way of doing this kind of thing, but I think we can take advantage of a trick used in the toml crate to get tag support earlier. The idea is that we have a magical struct TaggedValue<T> { tag: Option<u64>, value: T } type. It implements Deserialize and Serialize, but the struct name it provides to Deserializer::deserialize_struct and Serializer::serialize_struct is something special like __cbor_magic_tagged_type. The serializer/deserializer can then go into the special mode that passes the tag along rather than discarding it. I think this idea is due to @dtolnay, so he might have ideas as well.


#6

That’s also what I resorted to in rust-cbor :laughing: https://github.com/BurntSushi/rust-cbor/blob/34af4f22a5482d633e2fef91fd8104467d1ccb33/src/encoder.rs#L317-L332


#7

@BurntSushi I have seen your neat hack and assumed that it was specific to rustc-serialize, but now I see that it probably isn’t.

Looking at it again, I am under the impression that e.g. the MyDataStructure as defined in your docs would not work e.g. with JSON encoder (it would compile and run, but would produce a JSON dict with two __cbor_tag_encode_* keys since CborTagEncode uses the derived RustcEncodable). Do I get it or is there something I am missing?


#8

That sounds plausible? I haven’t looked at that code in ages so I don’t really have the context to answer unfortunately. If you’ve thought through it, then you’re probably right. :slight_smile:


#9

Yeah you’d end up with weird dicts like that.


#10

You can skip dict generation by checking is_human_readable but this means you will still get a dict in any other binary codec.

Also a good question is how to make similar thing work for serde_yaml (which is_human_readable).