Why does Serde not have an unordered sequence (aka set)?

I'm trying to understand how Serde works. When reading about the Serde data model, I noticed that the model has a wide variety of different types, of which some seem overlapping, such as:

  • unit vs unit_struct
  • newtype_struct vs the corresponding unwrapped type
  • byte array vs a seq of u8

There may be a good reason why these should be treated differently (but I don't understand enough yet to reason about it).

However, I noticed that there is nothing that covers unordered sequences (such as "set"). I guess you could use a map with unit as value, but given the above distinctions, it doesn't seem to be like the right thing to do. The Serde docs (see link to data model above) suggest that a HashSet simply uses seq.

Can someone explain to me why there is no set needed? Note that some data formats, such as YAML have a "set" type.

The hint is here:

It's because when you serialize a collection (or indeed a value of any type), whether it is ordered or not, over the wire it must necessarily be ordered. So in the process you may as well get rid of some potential bugs and actually define the order, which in this case is being done by just serializing it as a seq.

2 Likes

So it's because of the "serial" nature of serialization? Hmmm… :thinking: if the data model resembles real-world representation over the serial channel, then I don't understand why unit_struct is needed. Both unit and unit_struct are represented with 0 bits, right? Yet there's a difference being made.

For example ron makes a difference when the struct_names flag is enabled:

unit will write () while unit_struct will write StructName.

Hmmm, I see. Thank you for the example.

So ron makes a difference between unit and unit_struct, but I guess YAML would (want to) make a difference between seq and set? So I still feel like it's a bit of an odd (or abitrary?) choice. Perhaps I'm not fully understanding the practical problems of non-ordered serialized output yet, though.

To me, unordered sets are one of the fundamental data types in computing (at least more relevant than custom unit types). Consider, for example, Python's set and frozenset.

I know that JSON doesn't know anything about sets and I think it's most idiomatic to use arrays for that (instead of objects with true values)? And the RON specification also doesn't include a set. But YAML does. :man_shrugging:

Well, Yaml sets are arrays anyway (since the serialized data is fundamentally ordered), just with the overridden notion of "equality". I'd say this is a peculiarity of that specific format, not something that could be widely-used. Serde developers might have another opinion, of course.

2 Likes

If I understand it right, the advantage of supporting unordered sets in a serialization format, is that when you deserialize it into a dynamic type (such as serde_json's Value, assuming JSON would support unordered sets), you'll get proper behavior regarding quality. Not having set will force adding information where that information isn't existent.

To give an example in Python, using PyYAML:

#!/usr/bin/env python3
from yaml import dump, load, Loader
print(load(dump([1,2,3]), Loader=Loader) == load(dump([2,3,1]), Loader=Loader))
print(load(dump({1,2,3}), Loader=Loader) == load(dump({2,3,1}), Loader=Loader))

gives:

False
True

How could this work with Serde?

Note that YAML is one of the supported serialization formats for Serde. I just checked serde_yaml::Value:

pub enum Value {
    Null,
    Bool(bool),
    Number(Number),
    String(String),
    Sequence(Sequence),
    Mapping(Mapping),
}

Set is simply missing here. :frowning_face:

While I do think this is a logically consistent argument, I don't really see the real-life value of it. Sure, you'd do basically everything dynamically in Python, but not in Rust.

When using Serde, deserializing into a dynamic Value is usually either a last-resort escape hatch, or it comes up in applications that don't know/care about the structure and contents of the data (eg. transcoding), so they won't know/care about duplicates and equality comparisons either.

2 Likes

I guess I could say the same about unit_struct, but perhaps I'm just missing a good use-case here.

Regarding unit_struct vs set, one could argue that unit structs are rooted deeper in the Rust language than HashSet's (which are part of std, and not the core language). And Serde is a Rust library afterall. Still, I believe that serialization and deserialization should keep a variety of languages and concepts in mind because it is often used to exchange data between entirely different platforms, languages, and ecosystems.

Hence I feel like unit_struct has a lower significance than set.

Furthermore, if I understand it right, then serde_yaml isn't round-trip safe because of that choice (edit: but I didn't test that).

I just tested it:

Consider the following Python program:

#!/usr/bin/env python3
from yaml import dump
print(dump({"numbers": {1,2,3}}), end="")

gives:

numbers: !!set
  1: null
  2: null
  3: null

Let's feed that into and out of serde_yaml:

use serde_yaml as yaml;

fn main() {
    let input = "numbers: !!set\n  1: null\n  2: null\n  3: null\n";
    let doc: yaml::Value = yaml::from_str(input).unwrap();
    let output: String = yaml::to_string(&doc).unwrap();
    print!("Input:\n{input}\nDoc:\n{doc:?}\n\nOutput:\n{output}\n");
}

(Playground)

Output:

Input:
numbers: !!set
  1: null
  2: null
  3: null

Doc:
Mapping(Mapping { map: {String("numbers"): Mapping(Mapping { map: {Number(PosInt(1)): Null, Number(PosInt(2)): Null, Number(PosInt(3)): Null} })} })

Output:
---
numbers:
  1: ~
  2: ~
  3: ~


Here, the !!set marker is lost.

Let's try to re-import both the original input and the re-exported output in Python:

#!/usr/bin/env python3
from yaml import load, Loader
print("Original:")
print(
    load(
        """\
numbers: !!set
  1: null
  2: null
  3: null""",
        Loader=Loader
    )
)
print()
print("Re-exported by Rust's serde_yaml library:")
print(
    load(
        """\
---
numbers:
  1: ~
  2: ~
  3: ~""",
        Loader=Loader
    )
)

And this is what we get:

Original:
{'numbers': {1, 2, 3}}

Re-exported by Rust's serde_yaml library:
{'numbers': {1: None, 2: None, 3: None}}

:frowning_face:

I conclude that Serde's data model isn't sufficient to support YAML in a round-trip safe fashion, or am I missing something?

1 Like

To me, things that can't be enforced in the encoding are better handled by the deserialization impls for particular types rather than by encoding them differently.

At serialization time, what matters is that it's a sequence of items. To me, it's a good thing that I can deserialize a HashSet as a Vec if I want, to see the actual order of things in the encoded data. Or if someone else sent me a Vec of things, I can read it as a HashSet if I don't want duplicates. Needing that distinction in the byte stream itself seems counterproductive.

Protobuf (https://developers.google.com/protocol-buffers/docs/encoding#structure) does this well. The byte stream doesn't care whether it's a u32 or a u64, for example, because that doesn't matter to it.

2 Likes

But many formats will not support unit_struct either:

#[derive(serde::Serialize)]
struct Unit;

fn main() {
    let doc1 = serde_json::to_string(&()).unwrap();
    let doc2 = serde_json::to_string(&Unit).unwrap();
    println!("{doc1}");
    println!("{doc2}");
}

(Playground)

Output:

null
null

That may be true, but it's a choice of the serialization format, and not the choice of the library or its data model. Different serialization formats might decide to distinguish or not distinguish here. Those that distinguish can't be round-trip safe with Serde (unless I miss something), even if they just use the most basic types such as collections and primitive types.

I don't want to argue that Serde should support set. What I want to say is that if Serde supports unit_struct, then it could also support set, as set might arguably be more relevant than unit_struct.

1 Like

In indexmap, we ran into the flip side of this problem, that serde map won't necessarily preserve order in the serialization format:

https://github.com/bluss/indexmap/issues/156

I ended up adding a helper module to let people opt into strictly-ordered serialization for IndexMap, using #[serde(with = "indexmap::serde_seq")].

1 Like

FWIW, I have at least started working on a parser for a (ugly little) language loosely inspired by ron... crimson which does have sets... It doesn't have serde support, and I haven't really explored partially because of the limitations described in this thread...

There is a commented out syntax for order preserving trees, I have been on the fence about including Make a decison on order preserving key value collection · Issue #1 · ratmice/crimson · GitHub

alas, crimson has hit some limitations of my yacc lsp which for which it has been the primary example project for, because the yacc lsp can't currently lex things such as rust format strings. So it is likely needs to outgrow that for crimson to really proceed and actually be a useful project in its own right...

But it is a start... I am more than happy to have others weigh in on issues such as these sorts over on the issue tracker if there is interest...

How do I enable this struct_names flag? I can't find anything in the documentation.

With default options, the following just runs fine:

#[derive(serde::Serialize, serde::Deserialize, PartialEq, Eq, Debug)]
struct Unit;

fn main() {
    {
        let s: Unit = serde_json::from_str("null").unwrap();
        let u: () = serde_json::from_str("null").unwrap();
        assert_eq!(s, Unit);
        assert_eq!(u, ());
    }
    {
        let s: Unit = ron::from_str("()").unwrap();
        let u: () = ron::from_str("()").unwrap();
        assert_eq!(s, Unit);
        assert_eq!(u, ());
    }
}

struct_names is the last argument of the Serializer constructor. So you did do something like

let buf = Vec::new();
let mut s = Serializer::new(buf, None, /*struct_names*/ true)?;
value.serialize(&mut s)?;
Ok(String::from_utf8(s.output).expect("Ron should be utf-8"))

(adapted from the ron::ser::to_string function)

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.