Why are there 2 types for deserializing in serde?

I was trying to implement a Deserializer but I am super confused with the serde way of deserializing things.
All deserialize_* methods take a value visitor and each visitor take a solid type. Why are there deserialize methods if visitors are doing the thing. why deserialize_str is not simply trying to deserialize the thing and return a result. What is the visitor doing there? What am i supposed to pass there?

First of all, serde's objective is to generate serialize/deserialize code which is not even slower than the hand written type and format specific code, while completely decoupling types and formats and provide convenient #[derive(Serialize, Deserialize)] interface. Complexity of format implementation is less concerned as the number of formats are relatively small, a few's hard work can makes entire ecosystem fast.

Visitor implementation is a type-side thing, not format-side. It doesn't know anything about formats and same visitor will be used for both json, yaml, toml or even urlencoded data parsing. It's the Deserializer impl who performs actual parsing and feed the visitor with its expected value. You may found that all the deserialize_* methods have default impl which simply delegates to deserialize_any. So they all are more like an optimization hint. For example, json deserializer can fail fast if deserialize_seq is called but the data is not starting with [ character.

It still makes no sense to me how one would ever deserialize a struct since theres no visit_struct function. Also again since theres no visit_struct what visitor to pass to deserialize_struct?

1 Like

Did you read the serde book? I think the custom serialization section will be helpful, or maybe the data format section

Yes i did read but it didnt make much sense to me in my case I am trying to simplify a response structure from an api which has a lot of unnecessary nested items.
In the end i ended up creating a ton of local structs and operate on already deserialized data instead of doing something with the deserializer. It just seems to be close to impossible to do something with such complex data. All the examples are using a single level struct containing primitive values

I just dont get the idea why there are 2 different structures for deserializing something if visitor does the job why do deserialize methods even exist.

This point of this pattern is to make incorrect implementations nearly impossible, while preserving performance and extensibility.

You will have to try very hard to make an implementation of any of serde's traits that does something subtely wrong.

Can you please explain your statement because im failing to see how wrapping things inside each other helps to make something mistake free.
If one is trying to decode a string field using int decoder i think it can be caught without wrapping the thing inside a visitor.
I would understand if there was something called deserialize custom which recuired a custom visitor but why deserialize_int requires a visitor. The input is simply a character stream its not a rust type why is this complication ?

To expand a bit on @RustyYato's notes about the serde data model: serde has sort of a two-phase process to deserialization. One part deals with generic data formats, and the other deals with specific data structures.

The job of the Deserializer is to deal with the "general-purpose" data format. It parses input and converts it into the basic serde data model. It's kind of like if you took all of your inputs in whatever format (TOML, CBOR, etc.) and converted them all to JSON so that the rest of your code only had to know about JSON. In this case, though, instead of JSON, you have the serde data model, which is basically just a custom in-memory data format that happens to look kind of like a cross between JSON and basic Rust types. The whole point of this phase is to handle everything that's specific to any given data format and compartmentalize it so no other code has to care about it. Note that I'm talking about generic data formats like JSON or XML, not specific schemas for those formats (such as a specific JSON-based RPC format that defines specific object types that you'll need to handle). The schema is handled on the "other" side of serde.

Once the Deserializer translates the input data to serde's data model, it passes that to the visitor. (Sort of. As I understand it, for formats that aren't self-describing, the "phase 2" code has to tell the Deserialzer what it's supposed to expect before it does the conversion, so the "phase 2" code drives the Deserializer and actively asks it for the "next value" instead of just passively receiving a bunch of converted data. But that's not very important when it comes to understanding the basic concept.) The Deserialize trait then pulls the genericized data out of the visitor and tries to map it to a Rust data structure. It only cares about the serde data model, so it doesn't have to worry (for the most part) about format-specific details such as how to escape special characters in strings or whether the format stores integers as text or binary. The Deserializer has already handled that part. This is the phase that basically performs the task of your hypothetical deserialize_struct method.

This model does have its flaws. It's more complex than deserializing in one step, and some formats have features that don't map well into the serde data model. However, it turns out that the majority of general-purpose data interchange formats make use of the same concepts, and the two-phase model makes it possible to deserialize the same data structure from a wide variety of underlying formats.

7 Likes

It's not just wrapping things inside each other, and it's not just about avoiding mistakes.

There's really no simpler way you could possibly perform deserialization generically. The thing is, generic (de)serialization depends on both the data format and the concrete type being deserialized. But the documentation of Serde already explains this.

Types implementing the Deserializer trait deal only with the data format. They read and parse it and map it to the primitive types of the Serde data model, such as integers and strings. They have no knowledge whatsoever about the concrete, high-level, strong types (e.g. structs) being deserialized.

This is good because then the programmer who invents or implements a data format can write a completely generic deserializer, without having to worry about all the concrete higher-level types it might ever be used for. This is not only inconvenient, I'd say it's outright impossible. So we do need this generic approach and this level of indirection. The point is that a deserializer defines its format in terms of primitive values only, and thus it can account for encountering any of them, in advance. So, a Deserializer won't ever care about creating a MyCustomType, it will only look at its own format and call visit_str() or visit_u8() or visit_f32() on any visitor it is passed whatsoever.

This is why a deserializer passes these primitive types to Visitors. The Visitor types are used for mapping these low-level types to the user-defined, high-level types. Here's the symmetry: a Visitor has no idea which data format is driving it. A particular Visitor is only ever concerned with mapping an already-existing primitive type (wherever it comes from) to the one specific high-level type it is required to build. For example, an OptionVisitor (which is defined inside Serde) accepts a null or any other value, and builds an Option::None or an Option::Some, respectively. It has no idea if the null comes from JSON or CBOR or XML or MyAwesomeFormat. And it won't ever produce a String or a Vec<i32> or a HashSet<MyCustomType>, only an Option.

3 Likes

You are saying programmer is supposed to have no knowledge over the structure but why I am as a programmer supposed implement visitors on json in this case ? And if i am manually parsing json from just a string then what good serde does on its own. If only there was some examples how to parse something real it would make sense. Basically what i understand is that i need to implement the entire serde_json myself to get a small custom deserialization done is that correct ?

Certainly not, that's exactly one of the advantages of decoupling the data format from the data structures to be deserialized.

I don't understand what you mean by "implement visitors on JSON". You don't implement visitors on a data format. You implement a visitor that produces a specific high-level type. Your visitor will have no knowledge of where the data comes from. If you are using a serde-compatible JSON library, e.g. serde_json, that means that it will do the JSON parsing to a tree of primitive types, and then you can map those primitive types to your custom type.

It is also unclear what you mean by "manually parsing JSON". To me, that would mean writing a parser that eats the data character-by-character and takes care of the low-level syntactical details such as unescaping strings and matching parentheses. I don't think you need to do that because that's what serde_json does.

In itself, serde neither parses any particular data format nor does it perform any particular mapping of primitives to your custom types (it does provide mappings for built-in primitive and standard library types, though). It merely provides the glue abstraction between data formats and data structures.

Furthermore, the serde_derive proc-macro also helps you make your custom types implement the mapping to and from primitives by allowing you to write #[derive(Serialize, Deserialize)], but you can implement those manually if you wish, as I already explained. Then you could mix and match your types to any supported data format by using the appropriate library for the format, in the case of JSON this would be serde_json.


Please, study the documentation and the book more – you have already been given plenty of different and quite thorough pieces of explanation here. If you aren't comfortable with handling several levels of abstraction, maybe you should get some more experience with programming and/or software engineering concepts in general. This forum is not an adequate place for that purpose, however, you'll have to do that learning and research yourself.

Maybe you can share what you are wanting to do? I don't know what "a small custom deserialization" means, and I expect you'll get more satisfactory answers if we know what you're trying to accomplish.

I am implementing some private api which has nested fields at places and i need those fields to be top level what i want to do basically is the same thing as Json.Decode.Decoder.and_then of elm language which gets Decoder a and a function fn (a:T)->Decoder B and returns Decoder B This is exactly what i want to achieve

Am I right in understanding then that what you want to decode is valid Json?

Sounds like you might just need to drive Deserialize and use the flatten attribute. Or you could write your own Deserialize using another deserialize and applying your function to the output. Result might even have an and_then you could use?

edit: Mistaken and_then with map, but pasting the suggestion anyway.

Ok, so let's take a look at Implementing Deserialize section in the book... and it's indeed not helpful. It tells you how to write a visitor, but it doesn't tell you that for simple cases you don't have to do that. Let's take a look at the deserialize function signature:

impl<'de> Deserialize<'de> for B {
    fn deserialize<D>(deserializer: D) -> Result<B, D::Error>
    where
        D: Deserializer<'de>,
    { ... }
}

What's important here is that you:

  • need to return (a result) of B
  • you have access to a deserializer.

So, if you have some type A that's already deserializable (eg. a struct that more closely resembles original data format (with the nested fields you want to get rid of)), you can just call it:

impl<'de> Deserialize<'de> for B {
    fn deserialize<D>(deserializer: D) -> Result<B, D::Error>
    where
        D: Deserializer<'de>,
    {
         let a = A::deserialize(deserializer)?;
         Ok(convert_a_to_b(a))
    }
}

Using visitor is helpful for cases you don't want to create this intermediate A representation and you want to parse in more "streaming mode".


Ok, now I've looked more into Elm's function signature and turns out I've misunderstood your case and my suggestion is not really what you want. It seems that you indeed have to dive into visitors. Also, not sure if it's easy to reuse a visitor from a generated deserialize method. You can always use a temporary json, though (but that's an ugly hack):

impl<'de> Deserialize<'de> for B {
    fn deserialize<D>(deserializer: D) -> Result<B, D::Error>
    where
        D: Deserializer<'de>,
    {
         let json = serde_json::Value::deserialize(deserializer)?;
         let a: A = serde_json::from_value(partial.clone())?;
         // some logic based on a, parhaps transfrom the json
         // and then deserialize using some other deserializer
         serde_json::from_value(json);
    }
}

That is exactly what i needed to do created a intermeddiate struct and serialized the thing into that and then returned my actual Self from the deserialize as you mention it seems close to impossible hard doing the thing in "streaming mode" also what i dont like in this approach i did is that this is highly coupled with serde_json which means if the input is not it will simply fail

Considering Elm's andThen would suggest to deserialize into serde_json::Value and transform that using "normal" application logic instead of implementing Deserialize as this is exactly what Elm does:

Parse the given string into a JSON value and then run the Decoder on it.

and which is obviously limited to JSON as well. This way there is no implicit coupling and you do not need to consider the intricacies of Serde at all.

Also consider that this cannot work in "streaming mode" in general as andThen is used to select a decoder based on a previously decoded value. As I understand your problem, this would mean selecting a decoder for the other fields based on the value of some field F, but for JSON and many other formats there is no ordering for the fields and hence no guarantee of parsing the value of F before the other fields whose structure depends on F. This is also why it seems so hard to implementing this using visitors: The visitor is driven by the deserializer and for a struct cannot make any assumptions on the order in which the fields will be handed to it so that a construct like andThen is not just hard but actually impossible without keeping an intermediate representation around.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.