More permissive yaml deserialization

Here's a config file I would like to be able to write:

potential:
  - kc-z:
      # cutoff: 11
  - rebo:
      params: lammps
phonons:
  eigensolver: dense

Unfortunately, I am instead forced to write:

potential:
  - kc-z: {}
      # cutoff: 11
  - rebo:
      params:
        lammps: {}
phonons:
  eigensolver:
    dense: {}

Why? Well:

  • Thanks to the commented-out line, kc-z: actually had no key-value pairs. Empty mappings and empty sequences are inexpressable in YAML block syntax, and so it ends up being equivalent to kc-z: null.
  • dense and lammps are supposed to be unitlike enum variants. However, even though they currently have no fields, there is a remote possibility that I might want to add some optional fields in the future. A struct variant is not capable of deserializing from a string like a unit variant is. Therefore, for future compatibility, I am forced to define the enum variants as empty struct variants with braces.
#[derive(Serialize, Deserialize)]
#[derive(Debug, Clone, PartialEq)]
#[serde(rename_all = "kebab-case")]
pub enum PhononEigensolver {
    #[serde(rename_all = "kebab-case")]
    Dense {}, // <-- braces in case fields are ever added

    // other variants...
}

It would be great if I could somehow get these types to be more permissive in their deserialization:

  • Allow null to be deserialized as an empty struct Foo {}, or as the body for an enum variant Enum::Foo {}.
  • Allow null to be deserialized as an empty Vec<_>.
  • Allow a string to be deserialized as an empty struct enum variant Enum::Foo {}.

Unfortunately, I think that all of this behavior is determined by the Deserialize impl, which means that my possible options are either to find a way to do this with #[serde] annotations, or to give up #[derive(Deserialize)]. I'm not sure about the former, and the latter is absolutely infeasible.

Is there a way to do this that I'm not seeing?

One (extreme) solution could be to reimplement your own YAML Deserializer.

1 Like

Here's a pattern I like to allow null to turn into a specific case:

use serde::{Deserializer, Deserialize};

fn vec_null<'de, D: Deserializer<'de>>(d: D) -> Result<Vec<u32>, D::Error> {
    let res: Option<Vec<u32>> = Deserialize::deserialize(d)?;
    Ok(res.unwrap_or(Vec::new()))
}

#[derive(Deserialize, Debug)]
struct A {
    #[serde(deserialize_with = "vec_null")]
    vec: Vec<u32>,
}

fn main() {
    let json = r#"{ "vec": null }"#;
    let a: A = serde_json::from_str(json).unwrap();
    println!("{:?}", a);
}

Serde reference.

1 Like

As for turning some arbitrary string into something else, you can try:

use serde::{Deserializer, Deserialize};

/// Convert the string "nothing" into the empty vec.
fn vec_nothing_str<'de, D: Deserializer<'de>>(d: D) -> Result<Vec<u32>, D::Error> {
    #[derive(Deserialize)]
    #[serde(untagged)]
    enum Data<'a> {
        String(&'a str),
        Other(Vec<u32>),
    }
    match Data::deserialize(d)? {
        Data::String(s) => if s == "nothing" {
            Ok(Vec::new())
        } else {
            panic!("Return a proper error here")
        },
        Data::Other(vec) => Ok(vec),
    }
}

#[derive(Deserialize, Debug)]
struct A {
    #[serde(deserialize_with = "vec_nothing_str")]
    vec: Vec<u32>,
}


fn main() {
    let json = r#"{ "vec": "nothing" }"#;
    let a: A = serde_json::from_str(json).unwrap();
    println!("{:?}", a);

    let json = r#"{ "vec": [1, 2, 3] }"#;
    let a: A = serde_json::from_str(json).unwrap();
    println!("{:?}", a);
}

It will try each case in the enum in order until one succeeds.

Of course, if you replace the panic with a proper error, any other string will be forwarded to the next case as well.

Thanks. These suggestions help clear up how to solve the first two bullets. Alas, I still don't see a solution to the third bullet (which unfortunately is the biggest usability problem of them all!)

Ah, hmmmm. Yes, it looks like that could conceivably work.

Unfortunately it does not look like this could be done by wrapping serde_yaml's Deserializer, even if it were publicly exposed; I would only be able to call a single deserialize_* method on the inner Deserializer. Hence, I would have to reimplement it from scratch.

This makes me think, though, that perhaps this can be implemented inside serde_yaml itself, by adding configuration settings to the Deserializer (possibly through the Builder pattern).


Ah, yes, that definitely is doable for Vec. It can also be done for general structs:

#[derive(Debug, Deserialize)]
struct Holder {
    #[serde(deserialize_with = "null_struct")]
    thing: Thing,
}

#[derive(Debug, Deserialize)]
struct Thing {
    #[serde(default = "thing_foo")] foo: i32,
    #[serde(default = "thing_bar")] bar: String,
}
fn thing_foo() -> i32 { 4 }
fn thing_bar() -> String { "hi".to_string() }

fn null_struct<'de, D: Deserializer<'de>, T: DeserializeOwned>(
    deserializer: D,
) -> Result<T, D::Error> {
    let res: Option<T> = Deserialize::deserialize(deserializer)?;
    Ok(res.unwrap_or_else(|| serde_json::from_str("{}").unwrap()))
}

It is unfortunate though that we need to annotate the places where the struct is used as a field, rather than its definition. This is bad because the condition for null_struct to be valid is that all fields must be optional (so we want the decision to use null_struct to be near the list of fields). The closest I can come to a solution is:

#[derive(Debug, Deserialize)]
struct MaybeNull<T: DeserializeOwned>(
    #[serde(deserialize_with = "null_struct")] T,
);

#[derive(Debug, Deserialize)]
struct Holder {
    thing: Thing,
}

type Thing = MaybeNull<ThingInner>;

#[derive(Debug, Deserialize)]
struct ThingInner {
    #[serde(default = "thing_foo")] foo: i32,
    #[serde(default = "thing_bar")] bar: String,
}
fn thing_foo() -> i32 { 4 }
fn thing_bar() -> String { "hi".to_string() }

For enum variants it gets a bit dumber yet; enum variants will need to be factored out into their own type so that null_struct can work. The following:

// initial definition

#[derive(Debug, Deserialize, PartialEq)]
enum Enum {
    A,
    B {
        #[serde(default = "enum_b_foo")] foo: i32,
        #[serde(default = "enum_b_bar")] bar: String,
    },
}
fn enum_b_foo() -> i32 { 4 }
fn enum_b_bar() -> String { "hi".to_string() }

fn main() {
    assert_eq!(
        serde_json::from_str::<Enum>(r#"{"B": {}}"#).unwrap(),
        Enum::B { foo: 4, bar: "hi".to_string() },
    )
}

must become:

#[derive(Debug, Deserialize, PartialEq)]
enum Enum {
    A,
    B(EnumB),
}

type EnumB = MaybeNull<EnumBInner>;

#[derive(Debug, Deserialize, PartialEq)]
struct EnumBInner {
    #[serde(default = "enum_b_foo")] foo: i32,
    #[serde(default = "enum_b_bar")] bar: String,
}
fn enum_b_foo() -> i32 { 4 }
fn enum_b_bar() -> String { "hi".to_string() }

fn main() {
    assert_eq!(
        serde_json::from_str::<Enum>(r#"{"B": null}"#).unwrap(),
        Enum::B(MaybeNull(EnumBInner { foo: 4, bar: "hi".to_string() })),
    )
}

This seems untenable for enum variants (which constitute, well... 100% of the cases where I want strings to act like something else), such as Enum::B above. I want "B" to be equivalent to {"B": {}} for such a variant. To do this I would have to manually derive Deserialize for the entire Enum.

Edit: Also, I'm not sure if there is any way to do this without losing error messages from data nested inside the variant. You had to use #[serde(untagged)] to accomplish this, which throws away all error messages and replaces them with "data did not match any variant of untagged enum Data".

P.S. Deserializing to &str is usually a bad idea methinks, due to the possibility of the input containing escape sequences. I would use Cow<'a, str>.

I'm sure you've probably seen it, but just in case:

This sounds similar to the situation covered in https://serde.rs/string-or-struct.html

1 Like

Okay, so here's a typical enum: (I added a variant with a required field)

#[derive(Debug, Deserialize, PartialEq)]
#[serde(rename_all = "kebab-case")]
pub enum Enum {
    A,
    #[serde(rename_all = "kebab-case")]
    B {
        #[serde(default = "enum_b_foo")] foo: i32,
        #[serde(default = "enum_b_bar")] bar: String,
    },
    #[serde(rename_all = "kebab-case")]
    C {
        required_field: i32,
    },
}
fn enum_b_foo() -> i32 { 4 }
fn enum_b_bar() -> String { "hi".to_string() }

With the following set of utilities:

// If the input is `null`, attempts to deserialize `{}` instead.
// (correctly forwarding the error message if that fails)
fn null_struct<'de, D: Deserializer<'de>, T: DeserializeOwned>(
    deserializer: D,
) -> Result<T, D::Error> {
    match Deserialize::deserialize(deserializer)? {
        None => from_empty_object(),
        Some(res) => Ok(res),
    }
}

fn from_empty_object<'de, T: DeserializeOwned, E: de::Error>() -> Result<T, E> {
    // `from_str` would add misleading line/col information, so use `from_value`
    serde_json::from_value(serde_json::json!({})).map_err(E::custom)
}

I was able to get the desired parsing semantics—with no compromise on error message quality—by writing the following: (I got rid of MaybeNull because I decided that the error messages are better if I use null_struct on all variants, regardless of whether they have required fields)

#[derive(Debug, PartialEq)]
pub struct Enum(pub EnumInner);

#[derive(Debug, Deserialize, PartialEq)]
#[serde(rename_all = "kebab-case")]
pub enum EnumInner {
    A,
    #[serde(deserialize_with = "null_struct")]
    B(EnumB),
    #[serde(deserialize_with = "null_struct")]
    C(EnumC),
}

impl<'de> Deserialize<'de> for Enum {
    fn deserialize<D: Deserializer<'de>>(d: D) -> Result<Self, D::Error> {
        struct Visitor;

        impl<'de> de::Visitor<'de> for Visitor {
            type Value = EnumInner;

            fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
                formatter.write_str("string or map")
            }

            fn visit_str<E: de::Error>(self, value: &str) -> Result<Self::Value, E> {
                match value {
                    "a" => Ok(EnumInner::A),
                    "b" => Ok(EnumInner::B(from_empty_object()?)),
                    "c" => Ok(EnumInner::C(from_empty_object()?)),
                    _ => Err(E::unknown_variant(value, &["a", "b", "c"])),
                }
            }

            fn visit_map<M: de::MapAccess<'de>>(self, map: M) -> Result<Self::Value, M::Error> {
                Deserialize::deserialize(de::value::MapAccessDeserializer::new(map))
            }
        }

        d.deserialize_any(Visitor).map(Enum)
    }
}

#[derive(Debug, Deserialize, PartialEq)]
#[serde(rename_all = "kebab-case")]
pub struct EnumB {
    #[serde(default = "enum_b_foo")] pub foo: i32,
    #[serde(default = "enum_b_bar")] pub bar: String,
}
fn enum_b_foo() -> i32 { 4 }
fn enum_b_bar() -> String { "hi".to_string() }

#[derive(Debug, Deserialize, PartialEq)]
#[serde(rename_all = "kebab-case")]
pub struct EnumC {
    pub required_field: i32,
}

There are, of course, still numerous problems:

  • What was once 16 lines is now 57.
    • Technically, none of the Deserialize impl depends on the type except for visit_str. I could factor most of it out by adding another trait, but I doubt the savings in vertical real-estate will be huge.
  • When I add a variant to the enum, the compiler cannot remind me to add a branch to the match.
  • Inside the match, it is possible to write a typo such that e.g. "b" produces an EnumInner::C.
  • I need to repeat the variant keys in the match patterns, as well as in the unknown_variant function call.
  • This is difficult to automate with a macro, because the keys may be renamed and/or aliased using serde tags.

Here is a full playground with a meager test suite.

1 Like

Welp. I tried writing a proc_macro to handle it.

I got it up to the point where it could turn this:

#[config_type]
#[derive(Debug, Serialize, Deserialize)]
struct Thing {
    #[serde(default = "thing_foo")] foo: i32,
    #[serde(default = "thing_bar")] bar: String,
}

into this

#[derive(Debug)]
struct Thing {
    foo: i32,
    bar: String,
}

const _: () = {
    // A type that exists solely for #[derive(Serialize)]
    #[derive(Serialize)]
    struct Ser<'a> {
        #[serde(default = "thing_foo")] foo: &'a i32,
        #[serde(default = "thing_bar")] bar: &'a String,
        #[serde(skip)] _marker: &'a (),
    }

    // A type that exists solely for #[derive(Deserialize)]
    #[derive(Deserialize)]
    struct De {
        #[serde(default = "thing_foo")] foo: i32,
        #[serde(default = "thing_bar")] bar: String,
    }

    impl<'de> Deserialize<'de> for Thing {
        fn deserialize<D: Deserializer<'de>>(deserializer: D) -> Result<Self, D::Err> {
            // delegate to <De as Deserialize>::deserialize
            // but also add some custom logic
        }
    }

    impl Serialize for Thing {
        fn serialize<S: Serializer>(&self, serializer: S) -> Result<S::Ok, S::Err> {
            // delegate to <Ser as Serialize>::serialize
            // but also add some custom logic
        }
    }
};

forming a very basic template for how #[derive(Serialize, Deserialize)] can be patched with custom logic without having to split a type into some Thing/ThingInner pair.

Then my prospects started looking ugly:

  • So far I can only support structs. Enums are more complicated. (and they're the main impetus for this!)
  • My code will need to know the keys for the enum variants in order to add support for deserializing empty structs from strings. Serde doesn't expose its renaming logic. To be "correct" I should have my own rename annotations.
  • I can easily parse custom annotations using darling, and put them into some sort of config "tree" that's shaped like the AST, but...
  • ...once you reach the point where you have configuration attributes on containers, variants, and fields, it starts to become very unclear how the code should be structured. I end up with an awful lot of unreachable! because I have enums in the AST and in my configuration which are known to contain matching variants. syn's visitor and folding traits become useless because I can't easily access the configuration associated with an AST node.

It is readily apparent that this is no one-day project!