Serde: untagged enum ruins precise errors

Hello everyone,

I have a very nice piece of code using serde that works great. However, it has an issue that is causing more and more trouble as time goes on: terrible error messages.

The root of the problem is that the specification I'm trying to parse uses several alternative formats. Here is the type hierarchy I implemented to do this, from deepest to shallowest:

#[derive(Copy, Clone, Debug, PartialEq, PartialOrd, Deserialize)]
#[serde(try_from = "String")] // calls to a custom TryFrom<String> impl
pub struct TimeRange(f32, f32);

#[derive(Debug, Deserialize)]
pub(crate) struct FullInfo {
    /* list of fields, including "key" and "time range" */
}

#[derive(Debug, Deserialize)]
#[serde(untagged)]
pub(crate) enum ClipInfo {
    /// The full format: a bunch of specified options, all in the struct fields.
    Full(FullInfo),
    /// The compact format: (key, timerange) pairs with all other options defaulted.
    Compact(HashMap<String, TimeRange>),
}

/// The top-level file format. A single map of (filename, info) pairs.
#[derive(Debug)]
/* manual Deserialize impl as well, to e.g. verify path existence */
pub(crate) struct EditDecisionFile(HashMap<String, Vec<ClipInfo>>);

As you can see, the format of the "compact" and "full" is not the same. Seems perfect for an untagged enum, right?

It is, except error messages are ruined. That union means if anything anywhere inside has a parsing error -- a bad time range, a non-string, an accidental sequence instead of a map, a formatting mistake -- I only ever get one error message: data did not match any variant of untagged enum ClipInfo, with the pointer for line and column at the beginning of the item.

I have tried many things, but I always hit a wall.

  • I do not know how to "fold together" the different formats of items into a single struct when the compact format has arbitrary keys.
  • I used to manually parse out all the fields in the "full" format, with an extra field to detect if it was really compact format instead. That was a lot of work and error prone. The untagged union seems literally designed to avoid that.
  • I could add an Error variant to the enum, but that would get messy. In particular, I would need more than one, as different "successfully parsed incorrect value" types would be needed for different errors (e.g. "not a map" requires String, while "invalid timerange value" requires HashMap<String, String>).
  • I thought #[serde(other)] could solve this problem, but it doesn't work on untagged enums.

Finally, I even tried resorting to implementing my own untagged union, where I try to parse the full version and then fall back to the Compact version -- passing any error messages trying to parse that back up. But I can't do that, because serde doesn't expose the fallible construct inside the untagged enum logic that I need. Something like:

let unp: UnparsedValue = map.try_next_value()?; // error on EOF, syntax error, etc.
if let Ok(value) = unp.try_parse_as::<FullInfo>() {
    ClipInfo::Full(value)
} else  {
    ClipInfo::Compact(unp.try_parse_as::<HashMap<String, TimeRange>>()?)
}

Any guidance on how I approach this? Clearly a redesign is needed, but what?

You probably want something like Collect errors when deserializing untagged enums by killercup · Pull Request #1544 · serde-rs/serde · GitHub.

I wouldn't fully give up on manually implementing the untagged union yet. serde's Content type is unfortunately not stable (see Public alternative to Content and ContentRefDeserializer? · Issue #1947 · serde-rs/serde · GitHub), however you can try if some other value type is sufficient for your needs. There is the serde_value crate similar to the Content type. If you only care about a single format you can also use the value type they provide, such as serde_json::Value or toml::Value.

1 Like

Can't you avoid either extreme (full #[derive] vs. full manual impl) by only doing what the untagged deserializer does at a high level, except keeping track of the errors? What I mean is:

  1. Write impl Deserialize for ClipInfo manually; except:

  2. Delegate to the Deserialize impls of FullInfo and HashMap<String, TimeInfo>.

  3. Try both variants in order, but hold onto any eventual errors. You can circumvent consuming the deserializer by e.g. deserializing into a cloneable, dynamically-typed representation that is also a Deserializer, such as serde_json::Value. Something like this:

    use serde_json::Value;
    
    impl<'de> Deserialize<'de> for ClipInfo {
        fn deserialize<D: Deserializer<'de>>(deserializer: D) -> Result<Self, D::Error> {
            // perhaps catch this error as well, if needed
            let value = Value::deserialize(deserializer)?;
            match FullInfo::deserialize(value.clone()) {
                Ok(full) => Ok(ClipInfo::Full(full)),
                Err(outer_error) => match HashMap::deserialize(value) {
                    Ok(compact) => Ok(ClipInfo::Compact(compact)),
                    Err(inner_error) => {
                        // Here you have access to both `outer_error` and `inner_error`
                    }
                }
            }
        }
    }
    

That is a clever trick, @H2CO3 and @jonasbb! I wasn't thinking about the fact that the format I'm parsing (YAML, in my case) had a Value that was itself Deserializable. I was trying to do it all within the deserializer itself!

After having spent about 15 minutes sketching out the actual code, and seeing what the error messages look like, it's not perfect.

The error messages point to the beginning of the ClipInfo, no matter what the problem is, and the optional Location in the serde_yaml::Error itself is always None. Never the less, it preserves the error message, and I can manually patch in extra text so that it mostly makes sense.

Until something like the PR @jonasbb mentioned gets merged, I will consider this the solution. Thanks!