How to learn serde internals?

How can we learn more about serde's Deserialize(r) and Serialize(r) stuff? How can we learn how serde uses them and how other things use them and what we can do with them?

We wanna write very cursed code using Deserialize(r) and Serialize(r). But we need to understand them in-depth first.

(Also, somewhat unrelated, has anyone made an async equivalent of serde?)

Have you gone through the examples on the serde website? They're pretty good jumping off points for getting started on writing custom serialization code as well as custom data format code

https://serde.rs/custom-serialization.html

2 Likes

Yes, but we're not sure how to jump from that to wrapping/MITMing Serialize(r)'s and Deserialize(r)'s and Visitors and SerializeWhatevers and WhateversAccess and whatnot.

I think the XY question here is: instead of asking how to understand serde internals; perhaps you can tell us what you want to do after you understand serde internals.

I wrote something like poor-mans-serde from scratch. There was not any async involved. It was just one giant hideous procedural macro.

some things are too complex to ask for help directly.

tho being able to run multiple serdes in the same thread would be useful for cursed hacks.

Why would async help serde? Does serde now make file_system / network IO calls ?

so you can concurrently drive multiple serdes (from the same input). it's a weird use-case but it'd be useful for some stuff.

What does serde block on that makes async useful ?

let's say you have a Deserializer, and two complex Deserialize. you want both Deserialize to get the exact same Deserializer and one of them drives the Deserializer while the other has to just accept it somehow.

you just can't really do this without async.

Can you give a concrete example? What does this even mean? How does async Serde help with this ?

it lets you yield one Deserialize and go drive the other Deserialize.

think Lua coroutines.

the main thing is being able to seamlessly "shift" the deserialization to the other Deserialize when meeting an IgnoredAny.

e.g. let's say you're using datafu and you wanna combine predicates, and the predicates are actually just Deserialize. so you have some :$str:$url stuff going on. so what you wanna do is call Deserialize on String while also calling Deserialize on Url. and whatever you get from String you feed into Url, unless you get an IgnoredAny - in which case you ask Url what it wants from the underlying Deserializer, instead of asking the underlying deserializer for an IgnoredAny.

What you get from this is that it only nests as deep as you need it to, instead of allocating a bunch of space for potentially malicious input. It means the Deserialize's drive the deserialization, instead of the Deserializer.

If all Deserialize's ask for IgnoredAny, then you do drive the underlying Deserializer with an IgnoredAny, and everyone is happy and you're not blowing out the native call stack with exponentially many wrappers, because you can properly switch between them.

How does async help with this? It sounds like you need something that can 'invert control logic' of the deserialization process, and give you two continuations, one for continuing / one for early abort. Are you trying to abuse async as a continuation ?

Async is a continuation.

A real continuation would not require constant stack depth.
Async requires constant stack depth.

Therefore, to be able to use 'async serde deserialize' in this way, someone would have already had to rewrite 'serde deserialize' in a way that only uses constant stack depth. At that point, you could just use that API directly, and it's once again, not clear what async buys you.

you can write it straightforward, like you'd write sync serde, but then the code goes to do something else at runtime, completely transparently to you. well, you're aware it's happening because of all the .await but you also don't need to do anything special to care because it's written in just the same way as any other async fn.

anyway do you see the whole "this is more complex than a help thread should handle"?

consider 2 structs, Foo and Bar, like so:

struct Foo {
  foo: String,
  bar: String,
}
struct Bar {
  bar: String,
  baz: String,
}

and, by default, unknown fields are allowed by serde. anyway, one thing that's important to note here is that serde has the IgnoredAny optimization, so we can't just deserialize Foo and then deserialize Bar, because that eats data.

let's say we want

struct FooBar {
  foo: String,
  bar: String,
  baz: String,
}

and let's say we have {"foo": "hello", "bar:" "hello", "baz": "hello"}. let's trace (simplified) what happens if we try to do it the naive way, but checking for the types Foo and Bar instead of FooBar (this isn't how you'd do it in practice, but there are other use-cases that depend on this feature.):

1.  let x: FooBar = datafu::Pattern::compile(
        "(:$Foo:$Bar->'foo')(->'bar')(->'baz')", ...
    ).deserialize(json)?;
2.  FooBar::deserialize(datafu::Deserializer)
3.  datafu::Deserializer::deserialize_struct (this just starts the whole thing)
4.  Predicate (Foo)
5.  Foo::deserialize(datafu::Deserializer)
6.  datafu::Deserializer::deserialize_struct(FooVisitor)
7.  json::Deserializer::deserialize_struct(datafu::Visitor)
8.  datafu::Visitor::visit_map(json::MapAccess)
9.  FooVisitor::visit_map(datafu::MapAccess)
10. now here's where it gets exciting because MapAccess is pull-based.
    so our only option is to deserialize this one struct and store its results.
    we can't go back to "another" MapAccess to yield from it.

one "obvious" solution is to only allow one predicate, but then you lose a bunch of flexibility you'd otherwise have. for example, maybe you're using an URL library which supports multiple representations for deserializing URLs, but you want to enforce strings? you could instead do :$str:$url and it'd require it to be a string, instead of another representation. this ability to combine predicates is very much a must-have.

in the case of JSON, luckily we can just intercept IgnoredAny/deserialize_ignored_any and forward it to deserialize_any, and get all the results that way, but we want this to be as general as we can make it.

okay we figured out what we want. it'll be available SoonTM.