I'm trying to understand how Serde works. When reading about the Serde data model, I noticed that the model has a wide variety of different types, of which some seem overlapping, such as:
newtype_structvs the corresponding unwrapped type
byte arrayvs a seq of u8
There may be a good reason why these should be treated differently (but I don't understand enough yet to reason about it).
However, I noticed that there is nothing that covers unordered sequences (such as "set"). I guess you could use a map with unit as value, but given the above distinctions, it doesn't seem to be like the right thing to do. The Serde docs (see link to data model above) suggest that a HashSet simply uses seq.
Can someone explain to me why there is no set needed? Note that some data formats, such as YAML have a "set" type.
It's because when you serialize a collection (or indeed a value of any type), whether it is ordered or not, over the wire it must necessarily be ordered. So in the process you may as well get rid of some potential bugs and actually define the order, which in this case is being done by just serializing it as a seq.
So it's because of the "serial" nature of serialization? Hmmm… if the data model resembles real-world representation over the serial channel, then I don't understand why unit_struct is needed. Both unit and unit_struct are represented with 0 bits, right? Yet there's a difference being made.
So ron makes a difference between unit and unit_struct, but I guess YAML would (want to) make a difference between seq and set? So I still feel like it's a bit of an odd (or abitrary?) choice. Perhaps I'm not fully understanding the practical problems of non-ordered serialized output yet, though.
To me, unordered sets are one of the fundamental data types in computing (at least more relevant than custom unit types). Consider, for example, Python's set and frozenset.
I know that JSON doesn't know anything about sets and I think it's most idiomatic to use arrays for that (instead of objects with true values)? And the RON specification also doesn't include a set. But YAML does.
Well, Yaml sets are arrays anyway (since the serialized data is fundamentally ordered), just with the overridden notion of "equality". I'd say this is a peculiarity of that specific format, not something that could be widely-used. Serde developers might have another opinion, of course.
If I understand it right, the advantage of supporting unordered sets in a serialization format, is that when you deserialize it into a dynamic type (such as serde_json's Value, assuming JSON would support unordered sets), you'll get proper behavior regarding quality. Not having set will force adding information where that information isn't existent.
While I do think this is a logically consistent argument, I don't really see the real-life value of it. Sure, you'd do basically everything dynamically in Python, but not in Rust.
When using Serde, deserializing into a dynamic Value is usually either a last-resort escape hatch, or it comes up in applications that don't know/care about the structure and contents of the data (eg. transcoding), so they won't know/care about duplicates and equality comparisons either.
I guess I could say the same about unit_struct, but perhaps I'm just missing a good use-case here.
Regarding unit_struct vs set, one could argue that unit structs are rooted deeper in the Rust language than HashSet's (which are part of std, and not the core language). And Serde is a Rust library afterall. Still, I believe that serialization and deserialization should keep a variety of languages and concepts in mind because it is often used to exchange data between entirely different platforms, languages, and ecosystems.
Hence I feel like unit_struct has a lower significance than set.
Furthermore, if I understand it right, then serde_yaml isn't round-trip safe because of that choice (edit: but I didn't test that).
To me, things that can't be enforced in the encoding are better handled by the deserialization impls for particular types rather than by encoding them differently.
At serialization time, what matters is that it's a sequence of items. To me, it's a good thing that I can deserialize a HashSet as a Vec if I want, to see the actual order of things in the encoded data. Or if someone else sent me a Vec of things, I can read it as a HashSet if I don't want duplicates. Needing that distinction in the byte stream itself seems counterproductive.
That may be true, but it's a choice of the serialization format, and not the choice of the library or its data model. Different serialization formats might decide to distinguish or not distinguish here. Those that distinguish can't be round-trip safe with Serde (unless I miss something), even if they just use the most basic types such as collections and primitive types.
I don't want to argue that Serde should support set. What I want to say is that if Serde supports unit_struct, then it could also support set, as set might arguably be more relevant than unit_struct.
FWIW, I have at least started working on a parser for a (ugly little) language loosely inspired by ron... crimson which does have sets... It doesn't have serde support, and I haven't really explored partially because of the limitations described in this thread...
alas, crimson has hit some limitations of my yacc lsp which for which it has been the primary example project for, because the yacc lsp can't currently lex things such as rust format strings. So it is likely needs to outgrow that for crimson to really proceed and actually be a useful project in its own right...
But it is a start... I am more than happy to have others weigh in on issues such as these sorts over on the issue tracker if there is interest...