In answer to my own question, it is not too difficult to do this using quick-xml. Whether a bug or a feature, quick-xml already ignores tags not explicitly specified in the serde-derived structure to be deserialzed. In the simplest case the data of interest is just under the root element:
// Cargo.toml:
// [dependencies]
// serde = { version = "1", features = ["derive"] }
// quick-xml = { version = "^0", features = ["serialize"] }
// Type-erased errors
type BoxError = std::boxed::Box<dyn
std::error::Error
+ std::marker::Send
+ std::marker::Sync
>;
// For deserializing <data> tag
#[derive(Debug, serde::Deserialize)]
struct Data {
key: String,
#[serde(rename = "$value")]
text: String,
}
// For deserializing document containing data
#[derive(Debug, serde::Deserialize)]
struct DataEnvelope {
data: Data,
}
fn main() -> Result<(), BoxError> {
// Deserialize directly when data is not deeply nested in garbage.
let xml_data = r#"
<document>
<useless>stuff</useless>
<data key="val">text</data>
<garbage />
</document>
"#;
let envelope: DataEnvelope = quick_xml::de::from_str(xml_data)?;
println!("data key: {}, text: {}", envelope.data.key, envelope.data.text);
Ok(())
}
In this example the DataEnvelope
structure matches the <document>
tag and its data
member matches the <data>
tag, which is deserialized as a Data
struct. Note how the <useless>
and <garbage>
tags were ignored.
It's not much more difficult to handle the more general case where the data of interest is nested to some arbitrary level within the root document. We can use a quick_xml::Reader
to iterate over each "event" in the xml. The approach is similar to the solution proposed by @raidwas, which was to try and have serde deserialize the data at every offset until it finds valid data. Instead of every offset we can try just the offsets corresponding to the starts of xml tags. In addition to being more efficient, this has the advantage of not accidentally deserializing the data from inside an xml comment.
// Deserialize indirectly when data is arbitrarily nested.
let xml_data = r#"
<garbage>
<more_garbage />
<inner_tag>
<useless>stuff</useless>
<data key="val">text</data>
<garbage />
</inner_tag>
<more and="more"><garbage></garbage></more>
</garbage>
"#;
let mut buf = Vec::new(); // reusable buffer for quick-xml
let mut xml_reader = quick_xml::Reader::from_str(xml_data);
let mut prev_pos: usize = 0;
loop {
match xml_reader.read_event(&mut buf)? {
quick_xml::events::Event::Start(_ /*tag*/) => { // iterate over tag starts
// Optional: if we are sure data is inside an <inner_tag>
// as opposed to a tag with some other name then we could:
// if &tag.local_name() == b"inner_tag" {
match quick_xml::de::from_str::<DataEnvelope>(&xml_data[prev_pos..]) {
Ok(envelope) => {
// Found <inner_tag> and was able to deserialize <data>
println!("data key: {}, text: {}", envelope.data.key, envelope.data.text);
break;
},
Err(_) => continue, // this tag didn't match, try next tag
}
// }
}
quick_xml::events::Event::Eof => {
// Reached end of xml without finding <data>
eprintln!("Couldn't find data in garbage.");
break;
},
_ => (), // ignore other events
}
prev_pos = xml_reader.buffer_position();
}
In this example serde matched <inner_tag>
as the DataEnvelope
, but it would have matched any tag with a valid <data>
tag nested inside of it. If we are certain that the data will always be nested inside of an <inner_tag>
then we can make the code yet more efficient by uncommenting the if
.