Xml-rs looking for ways to make parsing code more compact

Im parsing XML with xml-rs. I'm looking for ways to make my code more compact, since my current code looks overly verbose and the really relevant parts (the local_name I'm looking for and partially the namespace) sit somewhere within lots of boiler plate. This is my current code (note that as the parsing grows there will be more branches in the match that will basically look the same on the pattern side and that the called function find_sections(parser) will basically have the same shape, just looking for different XML elements etc.):

const NS_2010_MAIN: &str = "http://schemas.microsoft.com/office/powerpoint/2010/main";

fn find_section_list (parser: &mut impl Iterator<Item=xml::reader::Result<XmlEvent>>) {    
    while let Some(e) = parser.next() {
        match e {        
            Ok(StartElement {
                name:
                    xml::name::OwnedName {
                        ref local_name,
                        namespace: Some(ref ns),
                        ..
                    },
                ..
            }) if local_name == "sectionLst" && ns == NS_2010_MAIN => {                     
                find_sections(parser);
            }
            Ok(_) => {}
            Err(_) => {
                 panic!("Encountered error parsing XML");
            }
        }
    }
}

I've had these initial ideas, but both fail: (1) it would be nice to be able to write

/* does not compile */
match e {        
    Ok(StartElement {
        name:
             xml::name::OwnedName {
                 local_name: "sectionLst",
                 namespace: Some(NS_2010_MAIN),
                 ..
             }
    }) => { ... }

But this doesn't work since local_name and namespace have types String and Option not &str and Option<&str>

(2) I had the idea to use a macro, but if I understand correctly, a macro can only produce a pattern, but not a match arm (pattern plus the if-guard) that I would need here:

/* does not compile */
macro_rules! start_elem {
    ($ln:expr, $ns:expr) => (
        Ok(StartElement {
            name:
                xml::name::OwnedName {
                    ref local_name,
                    namespace: Some(ref ns),
                    ..
                },
            ..
        }) if local_name == $ln && ns == $ns
    )
}

match e {        
    start_elem!("sectionLst", NS_2010_MAIN) => { ... }

I'm looking for ways or good ideas to make this parsing more manageable.

Let's golf this code a bit, then!

Given that you handle errors by panicking, you can use expect() for now. If you move to returning Result later on, then you'll probably want to move to ? instead.

fn find_section_list (parser: &mut impl Iterator<Item=xml::reader::Result<XmlEvent>>) {    
    while let Some(e) = parser.next() {
        match e.expect("Encountered error parsing XML") {        
            StartElement {
                name:
                    xml::name::OwnedName {
                        ref local_name,
                        namespace: Some(ref ns),
                        ..
                    },
                ..
            }) if local_name == "sectionLst" && ns == NS_2010_MAIN => {                     
                find_sections(parser);
            }
            _ => {}
        }
    }
}

After doing that, it would be desirable to reduce the pattern noise a little bit. We can do this by splitting the complex pattern matching that you are doing in two parts:

  • Check that we're actually looking at a StartElement
  • Check that the local name and namespace match our expectation

You mentioned earlier that you add match arms to handle XML elements other than StartElement later on, so let's keep the first pattern matching as a match. Otherwise, I would turn it into an if let in order to reduce the level of nesting.

        /* ... */
        match e.expect("Encountered error parsing XML") {        
            StartElement { ref name, .. } => {
                if let Some(ref ns) = name.namespace {
                    if name.local_name == "sectionLst" && ns == NS_2010_MAIN => {                     
                        find_sections(parser);
                    }
                }
            },
            _ => {}
        }
        /* ... */

At this stage, we've gotten rid of the intimidating large pattern that you had initially. But we paid for it by increasing the level of nesting.

In an ideal world, we could fix this by replacing the nested if let { if {} } chain with this:

                /* ... */
                if name.local_name == "sectionLst" && ns == Some(NS_2010_MAIN) => {                     
                    find_sections(parser);
                }
                /* ... */

Sadly, however, we can't, because Option<T> can only be compared to Option<T>, and not to Option<U> where U: PartialEq<T>. It's not clear to me why...

...but anyway, we can work around this limitations of the top-level Option interface by peeking into the option's inner data ourselves:

                /* ... */
                let ns_ok = name.namespace.map_or(false, |ns| ns == NS_2010_MAIN);
                if name.local_name == "sectionLst" && ns_ok => {                     
                    find_sections(parser);
                }
                /* ... */

Putting it all together, we get the following:

fn find_section_list (parser: &mut impl Iterator<Item=xml::reader::Result<XmlEvent>>) {    
    while let Some(e) = parser.next() {
        match e.expect("Encountered error parsing XML") {        
            StartElement { ref name, .. } => {
                let ns_ok = name.namespace.map_or(false, |ns| ns == NS_2010_MAIN);
                if name.local_name == "sectionLst" && ns_ok => {                     
                    find_sections(parser);
                }
            },
            _ => {}
        }
    }
}
3 Likes

Hi @mieckert,

I've recently started working on feed-rs and wanted to switch it from the DOM based parser xml5ever to xml-rs

I then ran into the same problems you did - the parsing code disappeared in the midst of lots of boilerplate.

I ended writing a wrapper for the EventReader which provides a tree style iteration - you call children() on an Element and receive an iterator that terminates at the end of that scope. An example of the parsing code:

// Handles an Atom <author> or <contributor>
fn handle_person<R: Read>(element: Element<R>) -> Person {
    let mut person = Person::new(String::from("unknown"));

    for child in element.children() {
        let tag_name = child.name.local_name.as_str();
        let child_text = child.child_as_text();
        match (tag_name, child_text) {
            // Extract the fields from the spec
            ("name", Some(name)) => person.name = name,
            ("uri", uri) => person.uri = uri,
            ("email", email) => person.email = email,

            // Nothing required for unknown elements
            _ => {}
        }
    }

    person
}

For your case, you'd be able to write something like the below I presume:

for child in element.children() {
        let tag_name = child.name.local_name.as_str();
        match (tag_name, child.namespace) {
...

If it is of interest, I uploaded it as a gist - element_source.rs · GitHub

The gist contains ElementSource which takes a Read source and internally creates the EventReader, along with some tests + a sample XML file.

(disclaimer: I'm very new to Rust, so the code is possibly horrid :slight_smile: )

2 Likes

Thanks! That has definitively put me on the right track. I found that xml-rs has a function name.namespace_ref() to obtain an Option<&str> instead of Option<String>. So this has now led me to the following (reverting to having the if in a guard expression):

fn find_section_list(parser: &mut impl Iterator<Item = xml::reader::Result<XmlEvent>>) -> xml::reader::Result<()> {
    while let Some(e) = parser.next() {
        match e? {
            StartElement { ref name, .. }
                 if name.local_name == "sectionLst"
                    && name.namespace_ref() == Some(NS_2010_MAIN) =>
            {
                 find_sections(parser)?;
            }
            EndDocument { .. } => { return Ok(()); }
            _ => {}
        }
    }

    // Note: any errors are returned prior through e? and the ending document case is handled in 
    // the match.  Therefore, we really shouldn't be able to reach this point
    panic!("Error parsing XML");
}

It still doesn't make me shout "yay", but it looks a whole lot better than what I started with. (I guess Mark's approach of transforming the structure into something "more matchable" is worth looking into for me as well.)

1 Like

Glad it worked out!

By the way, I figured out meanwhile why the standard library does not allow comparing Option<String> and Option<&str> directly by implementing Option<T>: PartialEq<Option<U>> where T: PartialEq<U>.

The problem, as often with Option, is that adding this feature without extra precautions makes type inference undecidable. If there is such a generic implementation of PartialEq around, then when someone writes some_option == None, the compiler cannot figure out what is the type of None anymore. It could be an Option of anything. Which forces use of unsatisfying workarounds such as None::<T>.

Long ago, there was a proposal to fix this by providing type inference with a default type to fallback to in such cases. Such a fix would also enable other interesting features, like indexing array-ish types with any integer type instead of just usize. However, the corresponding rustc feature development has stalled due to unresolved design questions and lack of interest...

2 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.