Serde ignore leading garbage

I would like to deserialize data from some XML of the form:

<garbage />
<more_garbage>
    <data key="val">more_data</data>
</more_garbage>
<yet_more_garbage />

Serde is an awesome crate with some support for XML through serde-xml-rs and quick-xml. It would be awfully convenient if I could deserialize data using serde. The challenge is how to ignore the surrounding garbage. It should be possible to have serde ignore trailing garbage, but I am not sure if there is an easy way to ignore leading garbage.

EDIT: See solution below. I am still open to alternatives if anyone has a better idea.

One possible solution would be to read the XML into a string, search the string for the index of some sentinel value like <data, and then ask serde to deserialize a slice starting at the data of interest. This solution is brittle because XML allows considerable variation in what a sentinel value might look like. For example, <data key="val"> and < data key = "val" > are different string literals that are both valid XML.

Do any veteran serde users know of a more elegant way to deserialize data with leading garbage?

Honestly I have no idea how serde works and if what you want to do is even possible to do efficiently (without writing some custom regex that matches all valid sections), but here is what I would do, and maybe improve on it later on if benchmarks indicate this is a good place to optimize.

An easy solution would be testing every offset until serde tells you there is something valid at that offset. Certainly not a very fast solution, even quastionable if it won't find something inside your garbage, but easy to implement for sure. If performance is of no (major) concern I would at least try it..

The only way to skip leading garbage correctly is to understand the structure of the garbage and parse it. I mean, consider if you were trying to find something in an HTML file. The "leading garbage" in that HTML file could easily contain a comment with something that looks identical to what you want to parse.

3 Likes

@ExpHP, this
This is a valid point. The unstated assumption here is that the leading garbage is XML. Serde understands the structure of XML via serde-xml-rs and quick-xml. It may be that I need to use these projects to parse through the leading garbage until I encounter the sentinel value. I am curious if there is a more straightforward/general way to achieve this using serde.

In answer to my own question, it is not too difficult to do this using quick-xml. Whether a bug or a feature, quick-xml already ignores tags not explicitly specified in the serde-derived structure to be deserialzed. In the simplest case the data of interest is just under the root element:

// Cargo.toml:
// [dependencies]
// serde = { version = "1", features = ["derive"] }
// quick-xml = { version = "^0", features = ["serialize"] }

// Type-erased errors
type BoxError = std::boxed::Box<dyn
	std::error::Error
	+ std::marker::Send
	+ std::marker::Sync
>;

// For deserializing <data> tag
#[derive(Debug, serde::Deserialize)]
struct Data {
    key: String,
    #[serde(rename = "$value")]
    text: String,
}

// For deserializing document containing data
#[derive(Debug, serde::Deserialize)]
struct DataEnvelope {
    data: Data,
}


fn main() -> Result<(), BoxError> {
    // Deserialize directly when data is not deeply nested in garbage.
    let xml_data = r#"
        <document>
            <useless>stuff</useless>
            <data key="val">text</data>
            <garbage />
        </document>
    "#;
    let envelope: DataEnvelope = quick_xml::de::from_str(xml_data)?;
    println!("data key: {}, text: {}", envelope.data.key, envelope.data.text);
    Ok(())
}

In this example the DataEnvelope structure matches the <document> tag and its data member matches the <data> tag, which is deserialized as a Data struct. Note how the <useless> and <garbage> tags were ignored.

It's not much more difficult to handle the more general case where the data of interest is nested to some arbitrary level within the root document. We can use a quick_xml::Reader to iterate over each "event" in the xml. The approach is similar to the solution proposed by @raidwas, which was to try and have serde deserialize the data at every offset until it finds valid data. Instead of every offset we can try just the offsets corresponding to the starts of xml tags. In addition to being more efficient, this has the advantage of not accidentally deserializing the data from inside an xml comment.

// Deserialize indirectly when data is arbitrarily nested.
let xml_data = r#"
    <garbage>
        <more_garbage />
        <inner_tag>
            <useless>stuff</useless>
            <data key="val">text</data>
            <garbage />
        </inner_tag>
        <more and="more"><garbage></garbage></more>
    </garbage>
"#;
let mut buf = Vec::new(); // reusable buffer for quick-xml
let mut xml_reader = quick_xml::Reader::from_str(xml_data);
let mut prev_pos: usize = 0;
loop {
    match xml_reader.read_event(&mut buf)? {
        quick_xml::events::Event::Start(_ /*tag*/) => { // iterate over tag starts
            
            // Optional: if we are sure data is inside an <inner_tag>
            // as opposed to a tag with some other name then we could:
            // if &tag.local_name() == b"inner_tag" {
            
            match quick_xml::de::from_str::<DataEnvelope>(&xml_data[prev_pos..]) {
                Ok(envelope) => {
                    // Found <inner_tag> and was able to deserialize <data>
                    println!("data key: {}, text: {}", envelope.data.key, envelope.data.text);
                    break;
                },
                Err(_) => continue, // this tag didn't match, try next tag
            }
            
            // }
            
        }
        quick_xml::events::Event::Eof => {
            // Reached end of xml without finding <data>
            eprintln!("Couldn't find data in garbage.");
            break;
        },
        _ => (), // ignore other events
    }
    prev_pos = xml_reader.buffer_position();
}

In this example serde matched <inner_tag> as the DataEnvelope, but it would have matched any tag with a valid <data> tag nested inside of it. If we are certain that the data will always be nested inside of an <inner_tag> then we can make the code yet more efficient by uncommenting the if.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.