Basic custom serde logic for serialize and deserialize

Hi all, I has been reading serde... I want be able to write a custom serialize/deserialize functions, actually for two things, one for network packages and for binaries.....

I have read the docs of serde, but I'm still confused, and can't understand very well how is implemented, and how it works.... maybe I'm also conceptually confused about what serde does, and what does not does (need to be implemented elsewhere).

For example, I don't fully understand the concepts of visitor and parse, which also involves serde.

Network packets and binary file are similar in several aspects, like one u8 can be split in several sections, bool, numbers, etc, even to read two u8 you it could be 4 bool, 1 u8 and 4 bool which makes a 16u data, while the u8 value is picked with half/half of each u8.

Or some network packets, the length data is written on the packet is self, so you can't read the data, without first extract and know the length.

Rn I would appreciate, if someone can explain conceptually the parts, and how they are related, a basic example which describe each section clearly.

Thx!

Serde has plenty of example code related to implementing both a data format and serializable custom types. Specifically what do you not understand?

I have read docs and the example, I don't know the concept in how it works.

Why serde does not parse? (written in docs) so what does serde?
what is a visitor?
how serde finds how to seralize of deserialize?
how is connected visitors with serialize/deserialize?
if there is a deserialize trait, why the docs says visitors deserialize the data?
How to deserialize complex data like the network packages I wrote above?

This is just some of the questions I have, if you notice the main issue, is that the each concept, at least to me is not clear, even if I read the docs.

Most of examples works over JSON (which is already implemented), or already using functions that exists... is harder to get simpler case that explains.

Pick this: Implementing Deserialize ยท Serde

If we follow it, the function deserialize_i32 does not exists, but exists the visitor_i32, while in the visitor seems to be use the naming visitor_inputformat, for deserialize seems to be sederialize_outputformat.

But really calls in that way? or there is other way? because use naming like that could cause problems if for example, there is crates with similar names, how can you call them?

The two examples above, read two u8 into 4 bool + 1 u8 + 4 bool, or need to read data knowing already deserialized info, how can that be handled on serde? conceptually should be done in deserialize? in visitor? maybe something related to parse?

At some extent, the docs explain some parts, but there is a lot of things that I can't found, you can find the visitor definition, but be able to make it sense in the whole picture and other cases has been pretty hard to me with the docs.

Sorry but I think you'll need to focus on a more narrowly scoped question and try to articulate it a bit clearly. You're asking a lot of questions and it's a bit unclear for all of them exactly what you mean.

I know you said you've read the docs, but in case you've missed it, the serde book is going to be good for a top-level overview of the concepts, and taking a look at the serde-json crate and bincode crate will give you some illustrative examples of what serde can do.

Serde basically acts as an M:N compatibility layer between (de)serializable Rust types (various structs, enums, etc), and serialization formats (JSON, raw bytes of various sorts, etc). To do this it establishes a abstract model of serialized data that's intended to abstract over most serialization formats. However, rather than ever directly constructing this intermediate form explicitly, it takes the form of a generic system for establishing callbacks through visitors and such. This is very performant, but also conceptually complicated, so it makes sense that you would be confused by it if you tried to dive in without familiarizing yourself with the basics.

If you investigate more and ask a question asking for an explanation of some more specific aspect of it you'll probably have more luck with people explaining it to you.

I will say, though, that if you are trying to deserialize some pre-existing highly specific binary format, like eg. JPEG files or raw TCP packets, serde is probably not going to work very well for that and that's not really what serde is for. However, for things like "I need to load this JSON file" or "I have these Rust structs and I need to send them over the network in binary", serde works well.

2 Likes

If you want a deeper understanding of how Serde works, I recommend you to watch this video from Jon Gjengset: https://www.youtube.com/watch?v=BI_bHCGRgMY.

thx @moy2010 I'll see the video :3 is long so must be after work.

@gretchenfrage thx for the clarifications, as you say dive without have the basic is hard, now if we follow the text, still is not very clear what serde does, at the same time what the basics are seems to be not very well defined. I suppose part of the basics are out of the serde book, or at least assumes some previous knowledge, which is not bad have ones, but is hard if we don't know which ones.

Why JPEG/TCP packets would not works well and for a JSON would works? which is the crucial difference? also why read a binary packet is out of serde, but from rust to binary would works? what is setting the limit?

As a note, I have used serde, no problems on the usability, but understanding deeper to write a custom serialize/deserialize has been the challenge.

Hm that's a good question. Well, serde tries to work both for formats that are self-describing and for formats that are not self-describing. So for example, we can think about 4 possible combinations:

  1. Format is JSON (self-describing), the JSON data is 1, and the Rust data type is i8 (statically knows what JSON type it needs to deserialize):

    • serde_json::from_str::<i8> calls <i8 as serde::Deserialize>::deserialize with an implementation of Deserializer from serde-json which wraps around the input string plus some additional context
      • i8 calls deserializer.deserialize_i8 with an implementation of Visitor that it defines
        • serde-json's deserializer calls visitor.visit_i8 with the integer 1 because that was what's in the JSON
          • i8's visitor returns the same 1 value that was inputted into it and this gets returned all the way back to the beginning of the call stack

    Now, in this case, if you tried to call serde_json::from_str with the Rust type i8 but the JSON string true, either the deserializer or the visitor would return an Err when this happens because i8 needs some integer type whereas false is a JSON bool.

  2. Now let's say the format is still JSON and the JSON data is 1, but the Rust data type is serde_json::Value, which is designed to be able to statically represent any JSON value. It would happen similarly to above except:

    • serde_json::Value would call deserializer.deserialize_any, because it doesn't know what data type it's supposed to be so it's hoping the the deserializer knows
      • Since JSON is self-describing, it does know that 1 is an integer, so serde-json's deserializer calls visitor.visit_i64 perhaps
        • serde_json::Value's visitor returns serde_json::Value::Number(1)

    Now, if you called this with the Rust type serde_json::Value but the JSON string true, instead of erroring this time, JSON's deserializer would call visitor.visit_bool, and the visitor would return serde_json::Value::Bool(true)

  3. Now, let's say the format is bincode (which is a very unsurprising, non-self describing binary format that basically just puts out the raw bytes). If you called bincode::deserialize::<i8>(&[0x01]), it would happen basically like the first JSON scenario: i8 calls deserialize_i8 because it knows to expect an i8, which tells bincode to interpret the 0x01 byte as an i8.

    On the other hand, if you called bincode::deserialize with the same binary data (a single 0x01 byte) but with the Rust data type bool instead of i8, it would call deserializer.deserialize_bool and bincode would interpret the 0x01 byte as the boolean true. Since bincode is not self-describing, the same data can be interpreted in multiple different ways, so the Rust data type must hint to the deserializer what data type it has.

  4. Finally, let's say you called bincode::deserialize with the Rust data type of serde_json::Value. serde_json::Value would call deserialize_any on bincode's deserializer, and bincode would return an Err saying basically, "bincode is not self-describing so you can't call deserialize_any; you need to tell bincode a specific type or it will error".

So serde works if either the data format is self-describing or the data format is statically typed, but not both.

I know that's not super directly an answer to your question but hopefully it makes some things more clear overall?

1 Like

Hi! thx answering it! things start getting better :3

Here some questions about the examples:

Imagine you have two crates with the same struct name, A::S, B::S, then you parse u8 to a visitor, but if we follow that logic and we want A::S or B::S the names will overlaps with visitor_S, so how can we create the visitors for A::S and B::S?

For now, I assume, correct me, deserialize will works in the global object, then visitors will handle specific input/output conversions, which means if we want to deserialize u8 to 8 bools, we would need to create a deserialize function to read each bit and parse to visitor_bool each one, is this right? and which deserialize should we implement and how?

Following a similar example, if we want to deserialize u8 to 8 bools, we could instead of write Vec, from here, it opens two questions, one is how deserialize and visitors works in this vector? there is a special deserialize/visitor for it? naming seems to also be harder with that, second, who specifices how to disamble u8, depends on the data, bool could be the full u8, or one bit, or groups of 4 bits, something must say how to treat it, I imagine visitor will only pick the extracted bits to be converted.

Thx!

Your questions have three main answers:

  • The final values created depend entirely on the data structure's implemention of Deserialize (normally by using derive configured by attributes) - there's no "global" state anywhere
  • How containers work is logically the same as any primitive type, I cover this a bit below, but it's mostly just using fancier callbacks to allow recursing back into the data format.
  • How bits and bytes are treated is entirely defined by the data format, serde doesn't require that there even are bytes or characters, only something you can get data from. The most obvious example of this is serde_json::from_value(), which can map a generic Value into your structures.

To rephrase earlier replies a bit, serde has an abstract data model used to communicate between data structures (De/Serialize implementors) and data formats (De/Serializer implementations), though they exist in the code only as methods on the serde traits.

The deserialize_{type} methods on Deserializer are the data structure telling the data format which data model type it expects (or any if it doesn't know), and the Visitor is the callback interface that lets the data format tell the data structure which type was actually present in the data. Data model container values like map, tuple and seq further use Access traits like MapAccess to give the structure some control about how it receives the values.

(This is from memory, I might have messed up a detail or two, sorry!)

Serde is generic glue code that mediates between:

  • the actual parser for a data format (such as JSON) that doesn't know anything about your types, and
  • your types, that don't know anything about any particular data format.

It does so by translating from/into a generic "value tree" representation.

2 Likes

mm, so the workflow would be like:

  • deserialize will have the parser
  • the parser will split the data, each section in raw
  • each section will be sent to the right visitor to be translated to the right type

Is this right?

And, sorry it was not clear to me, how to write deserialize and visitors for two structs with the same name and different crates/mods. deserialize_{type} seems to be in conflict with that.

All this would works fine with basic types, but when we have more complex structures, for example, the output will be Vec<S> while the input data will be a pairs number of [u8] (impair will throw error). And each S is construct with 12bits, so the parser could get from 12bits to 12 bits, and the visitor transform to the right type.

... But at the same time, if we see

we can see the visitor does not keep the naming conventions..... There is also the point on that example, if Payload has more than the vector, a new param x, so the parser should handle how much of the input data would go to values, and which one to x, but where and how that is defined?

First of all I'm hardly sure I understand your questions completely โ€“ in technical matters, precision is paramount, so I'd strongly advise you to find someone that speaks fluent English and your native language to help you translate your exact questions into English.

That looks like a method name and not like a visitor's type name, so I don't know what this means.

The shape of the serialized data must match the shape of the statically-typed structure it is being deserialized into. A struct will tell the deserializer that it expects a map of key-value pairs, a Vec will expect an array/list, a primitive integer will expect a single integer. Containers then apply this to their elements recursively.

The parser for a data format can then either be "self-describing" or schemaless, eg. JSON, where there are explicit delimiters around lists, maps, and strings (etc.), or a length before such dynamically-sized types, therefore arbitrary dynamic data can be recognized on its own.

Alternatively, it can be schemaful, eg. Bincode, where the data is just a bunch of bytes, and you always need the static types to make sense of the serialized bytes.

Hi, in very very global aspects, yes, but how do apply all what is described there? who will pick that function? is hard to write the right questions, when still the global map is not well described, what I notice is lacking right now, is functional definitions and relations on serde docs.

And sorry, my way to write, sometimes I include too much things in a single sentence.

The deserialize_{type} also implies visitor_{type}, in theory one should call the other.

In this case, instead of doing multiple questions, lets start with a single one and with a example.

Deserialize and serialize [120_u8, 200_u8, 8_u8, 16_u8]

Rules are simple, the input will be a vector, the struct has two elements, it is filled each 16 bits, when means each two values, the first value is get from the first 6bits, the second value from the remaining 10bits.

So, where will be the parser? what will do the deserialize? what will do the visitor?

struct Sector{
  x: u8,
  y: u8
}

# Deserialize to Vec<Sector>

The idea of solving this, is not exactly "get a solution", this will be useful if the solution describes how deserialize, deserialize_{type}, visitor_{type} and parser works and interact between them.

I think the thing you're missing is every deserialize use is given the type it's being deserialized to, Deserialize isn't "registering" a name anywhere, it's the interface that is used to create a value of itself.


Concretely, note the type parameter T and it's guard in

It's hard but not impossible to work with bit-level formats in Serde. For the particular example format you presented, I'd write it like this.

As for this question:

Visitors may be implemented for any number of custom types. The deserialize_{type} methods are on the Deserializer trait, which means they are a fixed set and you can't add your own method to Deserialize-the-trait. (And you don't need to. You are probably misunderstanding something else, too.)

Incidentally, I think your confusion may be related to this detail. Why are you so focused on the name of the visitor? It simply doesn't matter. You don't need to name your visitors in any particular way (although calling them SomethingVisitor is customary to some extent).

The Deserialize impl of a type instantiates and passes a Visitor to the Deserializer. You can construct an instance of a type independent of its name. Often, code generated en masse by a macro will just call the visitor type a Visitor (and hide it deep in a nested module or even in the body of the Deserialize::deserialize() method itself). It has no significance at all.

2 Likes