Why is it necessary to check for duplicate fields
"Necessary" is a strong word. It is my choice as the author of Foo
if I want to forbid duplicate fields just as it's my choice if I want to require the x
to be a u32
. There is a difference between invalid JSON/TOML/etc. and valid JSON/TOML/etc. but an invalid "map" based on my "schema".
A downside of a Deserializer
enforcing that duplicate "keys" don't exist is performance. Clearly such an implementation must be internally storing all of the keys in some collection which adds overhead; therefore as long as it's documented, I wouldn't be surprised if there exist libraries that parse a certain format without enforcing the uniqueness of each "key" even if <insert RFC/spec> states "keys" MUST be unique. At the very least, it's reasonable for said library to offer an alternative Deserializer
that doesn't enforce that in the interest of performance and simply states it's on the Deserialize
impl
to enforce it.
Fortunately, derive(Deserialize)
automatically forbids duplicate fields; so if you don't want that; then you have to implement it manually yourself and "hope" calling code uses a format that doesn't forbid duplicate fields and uses a library that doesn't error when duplicates do exist (if allowed by the spec) and the library either stores all instances allowing your code to chose the "correct" one or the choice that is made is the choice you want.
For example, JSON allows duplicate keys in JSON objects; so without checking for the duplicates myself, I would have to decide "arbitrarily" which one I should keep (or perhaps I'll store all instances of the key). As the author of Foo
, I don't want that though so I ensure there are no duplicates even if the format says it's OK. Additionally, there is likely a performance benefit for calling code to use JSON since the parser shouldn't have to store much state. Of course I'm assuming the JSON parser is implemented in a performant way which includes not storing each key (e.g., serde_json
). Obviously nothing stops one from implementing their own JSON parser in a non-performant way that either enforces uniqueness of each key (despite the spec allowing it) or allowing duplicates but arbitrarily picking which one or even storing all of them. Such implementations will obviously have additional overhead spatially and temporally compared to something like serde_json
. Also implementations that arbitrarily chose which key to use actually prevents me from erring when a duplicate key exists since those parsers are "lossy" and don't expose such info.
If TOML is used, however, then duplicates are forbidden by the spec; so my checks are unnecessary since an error will occur anyway so long as the library (e.g., toml
) actually enforces it—the error will be "different" in that it would be an actual TOML error and not just a "data"/"schema" error. Calling code that decides to use TOML will have to accept the performance hit of the Deserializer
storing each and every key internally.
Since Deserialize
impl
s are format-agnostic, I can't "assume" the Deserializer
will enforce unique keys just like I can't "assume" it'll forbid/allow "unknown" keys. In the special case where I know the format and the library/parser for said format, then of course I could specialize my implementation around that. For example if Foo
implements FromStr
such that it parses the passed &str
as TOML where internally it uses toml
; then Foo
could certainly implement Deserialize
such that it doesn't check for the duplicates since we know toml
already does that. Of course it would be better to create a private newtype
around Foo
that does this implementation since you want to "force" TOML since otherwise calling code could use Foo::deserialize
for any format instead of being "forced" to use Foo::from_str
.
Addendum
Because one may incorrectly assume "map" implies "key" uniqueness, it stands to reason one may also assume that a "map" implies a missing "key" to be the same as a "key" with a "null" value for those formats that have a notion of "null" (e.g., JSON). Unsurprisingly, that's not the case.
In particular the current impl
will have a "type" error if x
is "null" and will have a "missing field" error if x
doesn't exist. The below impl
s would be used for different behavior:
- Allow a missing
x
but forbid "null":
pub struct Foo {
pub x: Option<u32>,
pub y: bool,
}
impl<'de> Deserialize<'de> for Foo {
// â‹®
y.ok_or_else(|| Error::missing_field(Y))
.map(|y| Foo { x, y })
// â‹®
}
- Allow "null" but forbid a missing
x
:
pub struct Foo {
pub x: Option<u32>,
pub y: bool,
}
impl<'de> Deserialize<'de> for Foo {
// â‹®
x = map.next_value::<Option<_>>().map(Some)?;
// â‹®
x.ok_or_else(|| Error::missing_field(X)).and_then(|x| {
y.ok_or_else(|| Error::missing_field(Y))
.map(|y| Foo { x: x.flatten(), y })
})
// â‹®
}
- Allow both a missing
x
and "null" and treat them the same
pub struct Foo {
pub x: Option<u32>,
pub y: bool,
}
impl<'de> Deserialize<'de> for Foo {
// â‹®
y.ok_or_else(|| Error::missing_field(Y))
.map(|y| Foo { x: x.flatten(), y })
// â‹®
}
- Allow both a missing
x
and "null" but distinguish between them (specifically None
iff x
does not exist, Some(None)
iff x
exists and is "null", Some(Some(_))
iff x
exists and is not "null"):
pub struct Foo {
pub x: Option<Option<u32>>,
pub y: bool,
}
impl<'de> Deserialize<'de> for Foo {
// â‹®
y.ok_or_else(|| Error::missing_field(Y))
.map(|y| Foo { x, y })
// â‹®
}
Addendum 2
I should also warn that there are some "gotchas" that probably should be documented. For example, Visitor::visit_str
warns:
It is never correct to implement visit_string
without implementing visit_str
. Implement neither, both, or just visit_str
.
However there is no similar warning for visit_borrowed_str
even though many parsers will always error (e.g., toml
). Even more problematic are the visit_*
methods for "integers" which probably should have a warning, or even requirement, that one must implement visit_i64
/visit_u64
anytime any of the other "integer" methods are implemented. I've been bitten by this a few times after enough time has passed between encountering it. For example, if you have a type Even(u32)
that requires the contained u32
to be even and you define EvenVisitor
that only implements Visitor::visit_u32
, many parsers will error since many will call EvenVisitor::visit_u64
even if you hint for the Deserializer
to call Visitor::visit_u32
via Deserializer::deserialize_u32
. For example:
use core::fmt::{self, Formatter};
use serde::de::{Deserialize, Deserializer, Error, Unexpected, Visitor};
fn main() {
let json = b"2";
// This will be an "invalid type" error; however if I uncomment `EvenVisitor::visit_u64`, then this will succeed.
let err = serde_json::from_slice::<Even>(json.as_slice());
}
pub struct Even(u32);
impl<'de> Deserialize<'de> for Even {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: Deserializer<'de>,
{
struct EvenVisitor;
impl Visitor<'_> for EvenVisitor {
type Value = Even;
fn expecting(&self, formatter: &mut Formatter<'_>) -> fmt::Result {
formatter.write_str("even 32-bit unsigned integer")
}
fn visit_u32<E>(self, v: u32) -> Result<Self::Value, E>
where
E: Error,
{
if v & 1 == 1 {
Err(E::invalid_value(
Unexpected::Unsigned(u64::from(v)),
&"even 32-bit unsigned integer",
))
} else {
Ok(Even(v))
}
}
// I must implement this in addition to, or instead of, `Visitor::visit_u32`.
// fn visit_u64<E>(self, v: u64) -> Result<Self::Value, E>
// where
// E: Error,
// {
// u32::try_from(v)
// .map_err(E::custom)
// .and_then(|val| self.visit_u32(val))
// }
}
deserializer.deserialize_u32(EvenVisitor)
}
}