Leniently parsing data into strictly typed form while staying interoperable with the horrible 3rd party implementations out there

Background: I'm building on a Upnp library from the ground up (for reasons ...) and am looking for some advice on parsing invalid messages while maintaining interoperability with all the non-conform implementations out there ...

The problem: There is a spec which clearly defines mandatory & optional fields, value formats and invariants. I've been running a listener to grab real-life messages and have yet to find a single one which actually conforms to the spec.

Example: Upnp defines a specific Uri format which begins: uuid:<device-uuid>:: (V2 repeated states that <device-uuid> must be a UUID; V1 didn't it just included the term & reference spec in the glossary ...) So I have a message from a speaker which decided to use uuid:MANUF-SERIALNUMBER:: :face_with_spiral_eyes: :face_vomiting:

The Question: So how do I meaningfully parse & work with flaky, non-conform inputs and stay inter-operable? This feels like it's a specific case of a general problem ...

Ideas:

  • Stringly type everything (so instead of parsing the Uri into a struct including uuid: Uuid use uuid: String, or just do away with the struct altogether and store the whole uri as a string!) - negates my "reasons"
  • wrap all kinds of stuff in an enum Lenient<T> with variants Valid(T) and Invalid(String) - not sure how this would affect overall performance. I'd rather have some general understanding before going off on a long experiment with benchmarks and significant code changes, just to see ...
  • a trait Lentient<T> offering fn get_lenient(&header, key) which then constructs viable default values, or more likely often works in combo with the enum above.
  • just store valid values, wrap everything in an Option and let downstream worry about validating they have enough info to work with. (Not my style and I'd probably be better just using something that's already out there instead of doing this)

Or is there an obvious, and well-known solution that I'm just not connecting with this case right now? (Although usually in those cases I work it out just as I finish asking for help ...)

A UUID is practically a random string, so you don't get much benefit from parsing it (other than being able to store it in a fixed-size allocation).

In your position, I would define struct DeviceId(String);[1] so that I have a type for internal type checking, without attempting to parse the string or make any assumption that it is a UUID.

In cases where there is use for parsing the possibly erroneous contents, I don't have a better suggestion than your enum Lenient, but I would specifically recommend that you not parse data that you don't have to.


  1. or perhaps with Arc<str> as the string type ↩︎

yeah ... if it were just the uuid that was a problem ... Sadly each different device sems to have a different foible where they do something that's not-quite-up-to-spec.

Hadn't thought about Arc<str> rather than String, I'll keep that in mind as a 3rd option (alongside Cow) when I start benchmarking for memory & CPU load.

One of the goals of the lib is to be very strict when generating messages and as lenient as possible when receiving them. So, I really need (and want) to store data in strongly typed form if I'm going to be using the same types for generating and interpreting messages.

You could do something like this:

pub struct SomeType(pub(crate) String);

impl SomeType {
    pub fn new(content: &str) -> Option<Self> {
        // be very picky about what to accept
    }
}

Now, the user have to use SomeType::new() which enforce all invariants. But the parser, or other parts of the crate, can just construct SomeType with arbitrary content.

Not knowing much about the spec, if there are sets of fields that are used together when present or if some are missing it degrades in different (interesting) ways... then I'd be tempted to model it as an enum for each logical group of related fields.
Kind of mirroring the logic for how groups of fields are used in your implementation.

I'm thinking that will help your implementation/uses be able to distinguish between WellFormedX, DegradedY, DegradedZ, and Unknown variants without having to re-parse (in a manner of speaking) via Option::zip or similar. You could have functions operate on specific variants without having to repeat parsing "which broken variant is this" at use sites.
This assumes that there's a closed set of variations you plan to handle, and even enumerating those now might help bring some clarity / definition of scope.