Serde: How to create a Deserializer adapter to ignore duplicate fields in JSON data

I am researching how to create what dtolnay calls a "Deserializer adapter" for duplicate keys/fields in this issue comment on GitHub:

This should be implemented outside of Serde as a Deserializer adapter -- similar to how serde-ignored or serde-stacker wrap an existing Deserializer into their own Deserializer with extra behavior. Possible usage:

let mut j = serde_json::Deserializer::from_str(input);
let de = WhenDuplicateField::keep_first(&mut j);
let t = T::deserialize(de)?;|

Link to comment

My problem is that I use a service that I can't control, that sometimes outputs duplicate fields in JSON output.

{
    "key": "abc",
    "data": 5,
    "key": "abc",
}

My first idea was that someone else must have had the same problem before, and published an adapter on crates.io already, but I could not find one. If one exists, I would happily use it.

My second idea was to create the adapter myself, with the idea that I need to keep track of the JSON field keys the wrapping Deserializer sees, and then just skip delegating to the serde_json::Deserializer in those cases.

I took a look at serde-ignored. There is an impl for MapAccess that has next_key_seed. In that one it's possible to keep track of the encountered keys, giving CaptureKey an &mut HashSet.

What I think I need help with, is - at what layer can you skip or reject certain keys?

Thanks.

Edit: To be clear, I want to deserialize into a struct, not a BTreeMap. Something like this:

#[derive(Deserialize)]
struct S {
    key: String,
    data: u8,
}

This happens during the Visitor layer. More specifically, when deserializing structs it happens during visit_map: Manually deserialize struct ยท Serde.

Thanks, but I'm still at a loss for how to handle this generically for any struct

If we take the code from serde-ignored and modify it with your tip:

/// Forwarding impl to preserve context.
impl<'a, 'b, 'de, X, F> Visitor<'de> for Wrap<'a, 'b, X, F>
where
    X: Visitor<'de>,
    F: FnMut(Path),
{
    type Value = X::Value;

   // ... lots of methods

    fn visit_map<V>(self, mut visitor: V) -> Result<Self::Value, V::Error>
    where
        V: de::MapAccess<'de>,
    {
        while let Some(key) = visitor.next_key::<&str>()? {
            // What type is v? How do I deserialize it generically?
            let v = visitor.next_value()?;
        }
        // This was here before
        // self.delegate
        //     .visit_map(MapAccess::new(visitor, self.callback, self.path))
    }

I can't figure out how to continue generic deserialization when using next_key and next_value.

Any pointers would be appreciated!

You are visiting a map-like struct where the keys are strings and the values are anything that implements Deserialize<'de>. So, the next_key will give you the next key from the struct (i.e. key from your S struct), and the call to next_value would give you the corresponding value (String, following the given example).

If you provide a playground, I can help you poke into it.

The easiest, laziest solution is to deserialize keys as Strings and values as a dynamic value tree (eg. serde_value), perform the filtering, put them all back in a dynamic map data structure, and turn that into a deserializer.

Thank you. What I've done is to clean up the serde-ignored code (removed parameters that are note needed in this case) and to add a new seen_keys: HashSet<String> field in MapAccess. There is also a println! in almost every method.

Thanks, but I'd rather not do this if I can avoid it.

Sorry, I gave it a try for a few hours... and failed miserably :joy:. I think that a stateful deserialization is necessary in this case, but couldn't find a way to do it.

Thank you for your effort! I'm surprised by how hard this is to do.

In the case of duplicated keys, do you want a specific behaviour such as keeping the first one, keeping the second one, ot to be able to customise it, or just simply having it not to fail?

First or last, doesn't matter to me. The properties I want are:

  • Keep one of them
  • Don't fail
  • Works with all #[derive(Deserialize)] types.

The irony is serde is internally generating stateful Deserialize code keeping track of which fields have already been seen. It's actually easier and faster to generate the duplicate ignoring code (though not significantly so, as you still need to handle arbitrary field order)

You might want to look into macro generating Deserialize yourself if you want something "cleaner"?

I'm not 100% sure what you mean, but I think you're saying that this could (should?) have been implemented easily by serde itself, and that one way forward is to fork serde_derive and implement the fix at that layer instead?

Not serde, because then you have different traits and you'd also have to fork serde_json; but you could do this by forking serde_derive, the crate that serde re-exports when you enable the derive feature.

Once you look at what it's doing, you might find you don't need an entire proc-macro crate, but it should be pretty straightforward either way if I remember the code it generates.

I can't fork only serde_derive because it lives in the serde git?

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.