Review my xml parser

After reading the rust book, here's my first attempt to write rust. I decided to port one of my existing projects that I regularly use to rust, starting with one of the things it does: parse some data out of a structured xml file.

use roxmltree::Document;
use std::collections::HashSet;

#[derive(Debug)]
struct Meta {
    title: Option<String>,
    passwords: Option<HashSet<String>>, // Can appear multiple times
    tags: Option<HashSet<String>>, // Can appear multiple times
    category: Option<String>,
}

fn main() {
    let xml_content = r#"
    <abc>
        <head>
            <meta type="title">Title</meta>
            <meta type="password">secret</meta>
            <meta type="password">secret2</meta>
            <meta type="tag">Tag</meta>
            <meta type="category">Cat</meta>
        </head>
    </abc>
    "#;
    let doc = Document::parse(xml_content).expect("Failed to parse XML");

    let mut title: Option<String> = None;
    let mut passwords: HashSet<String> = HashSet::new();
    let mut tags: HashSet<String> = HashSet::new();
    let mut category: Option<String> = None;

    for meta in doc.descendants().filter(|n| n.has_tag_name("meta")) {
        if let Some("title") = meta.attribute("type") {
            title = match meta.text() {
                Some(text) => Some(text.to_string()),
                None => None,
            };
        }
        if let Some("category") = meta.attribute("type") {
            category = match meta.text() {
                Some(text) => Some(text.to_string()),
                None => None,
            };
        }
        if let Some("password") = meta.attribute("type") {
            if let Some(text) = meta.text() {
                passwords.insert(text.to_string());
            }
        }
        if let Some("tag") = meta.attribute("type") {
            if let Some(text) = meta.text() {
                tags.insert(text.to_string());
            }
        }
    }
    let metadata = Meta {
        title,
        passwords: if passwords.is_empty() {
            None
        } else {
            Some(passwords)
        },
        tags: if tags.is_empty() { None } else { Some(tags) },
        category,
    };
    println!("{:?}", metadata);
    // Meta { title: Some("Title"), passwords: Some({"secret2", "secret"}), tags: Some({"Tag"}), category: Some("Cat") }
}

The implementation isn't complete and it's only parsing part of the document. There are also things it should do that it doesn't, e.g., this check should be case insensitive:

if let Some("title") = meta.attribute("type") { }

The python equivalent:

@dataclass
class Meta:
    title: str | None = None
    passwords: tuple[str, ...] | None = None
    tags: tuple[str, ...] | None = None
    category: str | None = None

I realised rust doesn't offer a null value, so I went with wrapping each member in Option<T>.

Some notes:

  • I would translate tuple with Vec instead of HashSet
  • You could match on meta.attribute("type").map(str::to_lowercase)
  • Personally, I like empty containers better than None because Some(Vec::new()) is uninhabited but the same applies to python
1 Like

That's a pretty good start, and you've clearly understood the fundamentals extremely well. Nice work.

I would offer a couple of suggestions.

First, the use of Option<HashSet<T>> opens a question you probably don't need to have an answer for: is there a meaningful difference between None (no collection) and Some(HashSet::new()) (a collection, which is empty)? If not, then I might suggest replacing both passwords and tags' types with HashSet<String>, and using the empty set to represent the absence of a value. (As @Bruecki says, you might even want to use Vec here, instead, especially if the order the values appear in or the possibility of duplicate values are important.)

Second, if (in the domain model represented by this XML) a title or a category is effectively mandatory, I might make those non-optional fields, as well. To handle the possibility that they are absent at parse time, use a Result, via a parse function with a signature like

fn from_document(document: &Document) -> Result<Meta, ParseError> { /* … */ }

I can offer some pointers on how to fill in the blanks, if you need them.

1 Like

It's completely valid for a document to be missing any or all of the meta fields. All the meta fields are entirely optional in the actual documents.

Thank you! These changes do make a lot of sense so I'll drop the Option on passwords and tags and replace the Sets with Vectors.

Appreciate it! :grin:
I'll slowly keep chipping away at this project so there's definitely going to be a few more questions coming!

So I incorporated the feedback I got:

use roxmltree::Document;
use std::collections::HashSet;

#[derive(Debug, Clone, PartialEq, Eq)]
struct Meta {
    // Every field in Meta is optional
    title: Option<String>,
    passwords: Vec<String>,
    tags: Vec<String>,
    category: Option<String>,
}

fn main() {
    let xml_content = r#"
    <abc>
        <head>
            <meta type="title">Title</meta>
            <meta type="password">secret</meta>
            <meta type="password">secret2</meta>
            <meta type="tag">Tag</meta>
            <meta type="category">Cat</meta>
        </head>
    </abc>
    "#;
    let doc = Document::parse(xml_content).expect("Failed to parse XML");

    let mut title: Option<String> = None;
    let mut passwords: Vec<String> = Vec::new();
    let mut tags: Vec<String> = Vec::new();
    let mut category: Option<String> = None;

    for meta in doc.descendants().filter(|n| n.has_tag_name("meta")) {
        if let Some("title") = meta.attribute("type") {
            title = match meta.text() {
                Some(text) => Some(text.to_string()),
                None => None,
            };
        }
        if let Some("category") = meta.attribute("type") {
            category = match meta.text() {
                Some(text) => Some(text.to_string()),
                None => None,
            };
        }
        if let Some("password") = meta.attribute("type") {
            if let Some(text) = meta.text() {
                passwords.push(text.to_string());
            }
        }
        if let Some("tag") = meta.attribute("type") {
            if let Some(text) = meta.text() {
                tags.push(text.to_string());
            }
        }
    }

    let mut unique_passwords: HashSet<String> = HashSet::new();
    passwords.retain(|item| unique_passwords.insert(*item));

    let mut unique_tags: HashSet<String> = HashSet::new();
    tags.retain(|item| unique_tags.insert(*item));

    let metadata = Meta {
        title,
        passwords,
        tags,
        category,
    };
    println!("{:?}", metadata);
    // Meta { title: Some("Title"), passwords: Some({"secret"}), tags: Some({"Tag"}), category: Some("Cat") }
}

Notably, I replaced Option<HashSet<String>> with Vec<String> but I still want to remove duplicates while preserving their order.
I used this article as a reference: Remove duplicates while preserving order in Rust — David Chen
But after making the change, the compiler started complaining

consider cloning the value if the performance cost is acceptable: `.clone()`
main.rs(60, 43): original diagnostic
cannot move out of `*item` which is behind a shared reference
move occurs because `*item` has type `String`, which does not implement the `Copy` main.rs(60, 43): consider cloning the value if the performance cost is acceptable: `.clone()`

If I add .clone():

    let mut unique_passwords: HashSet<String> = HashSet::new();
    passwords.retain(|item| unique_passwords.insert(*item.clone()));

I get a new error:

mismatched types
expected `String`, found `str

This error occurs because the item (a reference) is valid within the scope of the closure. When you try to insert the item into seen, you are attempting to store the reference that will not be valid after the closure finishes executing.
So Rust compiler suggests you to create a clone of the String, and insert method will be able to own it.

This is because you are attempting to dereference a String and get str, but HashSet expects String. Simply remove the * dereference and code will compile.

passwords.retain(|item| unique_passwords.insert(item.clone()));

but IMO it seems code is getting more and more complicated :thinking:

It is! :sweat_smile:
Let me explain the case here more thoroughly so it's obvious what I'm trying to do:

I'm iterating through meta and pushing each in the tags vector.

if let Some("tag") = meta.attribute("type") {
            if let Some(text) = meta.text() {
                tags.push(text.to_string());
            }

What I want to achieve here is to keep the insertion order while also avoiding duplicates.

My python equivalent code would be:

>>> tags = []
>>> meta = ["abc", "def", "abc", "gef", "gef"]
>>> for tag in meta:
...     if not tag in tags:
...         tags.append(tag)
...         
>>> tags
['abc', 'def', 'gef']

You can use a contains:

if !passwords.contains(&text) { ... } // then push

Also for title and category you probably can use a map, it saves some lines:

title = meta.text().map(String::from);

map will create a new String only in Some case and will ignore None.

1 Like

Here's my current iteration of working code:

use roxmltree::Document;

#[derive(Debug, Clone, Hash, Default, PartialEq, Eq)]
pub struct Meta {
    pub title: Option<String>,
    pub passwords: Vec<String>,
    pub tags: Vec<String>,
    pub category: Option<String>,
}

pub(crate) fn parse_metadata(nzb: &Document) -> Meta {
    let mut title: Option<String> = None;
    let mut passwords: Vec<String> = Vec::new();
    let mut tags: Vec<String> = Vec::new();
    let mut category: Option<String> = None;

    for meta in nzb.descendants().filter(|n| n.has_tag_name("meta")) {
        if let Some("title") = meta.attribute("type") {
            // Save the first title and ignore any subsequent ones
            if title.is_none() {
                title = meta.text().map(String::from);
            }
        }
        if let Some("category") = meta.attribute("type") {
            // Save the first category and ignore any subsequent ones
            if category.is_none() {
                category = meta.text().map(String::from);
            }
        }
        if let Some("password") = meta.attribute("type") {
            // Ignore duplicates while preserving the order
            if let Some(text) = meta.text() {
                if !passwords.contains(&text.to_string()) {
                    passwords.push(text.to_string());
                }
            }
        }
        if let Some("tag") = meta.attribute("type") {
            // Ignore duplicates while preserving the order
            if let Some(text) = meta.text() {
                if !tags.contains(&text.to_string()) {
                    tags.push(text.to_string());
                }
            }
        }
    }

    return Meta {
        title,
        passwords,
        tags,
        category,
    };
}


fn main() -> Result<(), Box<dyn std::error::Error>> {
    let xml_content = r#"
    <abc>
        <head>
            <meta type="title">Title</meta>
            <meta type="password">secret</meta>
            <meta type="password">secret2</meta>
            <meta type="tag">Tag</meta>
            <meta type="category">Cat</meta>
        </head>
    </abc>
    "#;
    let doc = Document::parse(xml_content)?;
    let nzb = parse_metadata(&doc);
    println!("{:?}", nzb);
    Ok(())
}

I still haven't figured out how to make these checks case insensitive:

if let Some("title") = meta.attribute("type")
if let Some("password") = meta.attribute("type")

Earlier in this thread, I saw meta.attribute("type").map(str::to_lowercase) but this did not work for me:

error[E0308]: mismatched types
  --> src/main.rs:18:21
   |
18 |         if let Some("title") = meta.attribute("type").map(str::to_lowercase) {
   |                     ^^^^^^^    --------------------------------------------- this expression has type `Option<String>`
   |                     |
   |                     expected `String`, found `&str`

For more information about this error, try `rustc --explain E0308`.

Because str::to_lowercase cannot modify the original string, it returns a new String containing the lowercased version of its input. This is why meta.attribute(…).map(str::to_lowercase) has type Option<String> even though meta.attribute(…) has type Option<&'a str> for some lifetime 'a.

Some("password"), on the other hand, has type Option<&'static str>, because the contained string literal has the type &'static str.

While String and str both expose operations that allow you to compare a String to a &str for equality without having to convert either of them, those operations do not project into the Option type. Fortunately, because String implements Deref<Target = str>, you can get an Option<&str> from an Option<String> using as_deref:

if let Some("title") = meta.attribute("type").map(str::to_lowercase).as_deref() {
    // …
}
2 Likes

Thank you for the explanation! That worked like a charm :smile:

You're welcome!

Re-reading the code, I'd also suggest removing the let there. Using a constant pattern as a way of comparing a value for equality with a constant is technically not incorrect, but it's unusual and often hard to reason about. Something like

if Some("title") == meta.attribute("type").map(str::to_lowercase).as_deref() {
  // …
}

would be considerably more conventional.

1 Like

That makes sense, looks like I confused rust book's usage of if let to match patterns and ended up using it with constants

1 Like

I now have my first working as intended rust function! Thanks to everyone who helped out :smile:

// parser.rs
use roxmltree::Document;

#[derive(Debug, Clone, Hash, Default, PartialEq, Eq)]
pub struct Meta {
    pub title: Option<String>,
    pub passwords: Vec<String>,
    pub tags: Vec<String>,
    pub category: Option<String>,
}

pub(crate) fn parse_metadata(doc: &Document) -> Meta {
    let mut title: Option<String> = None;
    let mut passwords: Vec<String> = Vec::new();
    let mut tags: Vec<String> = Vec::new();
    let mut category: Option<String> = None;

    for meta in doc.descendants().filter(|n| n.has_tag_name("meta")) {
        if Some("title") == meta.attribute("type").map(str::to_lowercase).as_deref() {
            // In case there's more than one, we save the first one
            // and ignore any subsequent ones
            if title.is_none() {
                title = meta.text().map(String::from);
            }
        }
        if Some("category") == meta.attribute("type").map(str::to_lowercase).as_deref() {
            // In case there's more than one, we save the first one
            // and ignore any subsequent ones
            if category.is_none() {
                category = meta.text().map(String::from);
            }
        }
        if Some("password") == meta.attribute("type").map(str::to_lowercase).as_deref() {
            // Remove duplicates while presering the order in which they appear
            if let Some(text) = meta.text() {
                if !passwords.contains(&text.to_string()) {
                    passwords.push(text.to_string());
                }
            }
        }
        if Some("tag") == meta.attribute("type").map(str::to_lowercase).as_deref() {
            // Remove duplicates while presering the order in which they appear
            if let Some(text) = meta.text() {
                if !tags.contains(&text.to_string()) {
                    tags.push(text.to_string());
                }
            }
        }
    }

    return Meta {
        title,
        passwords,
        tags,
        category,
    };
}
// utils.rs
use std::path::Path;
use regex::Regex;
use std::sync::LazyLock;

static XML_HEADING_RE: LazyLock<Regex> = LazyLock::new(|| Regex::new(r"(?i)<\?xml\s+version.*\?>").unwrap());
static XML_DOCTYPE_RE: LazyLock<Regex> = LazyLock::new(|| Regex::new(r"(?i)<!DOCTYPE.*>").unwrap());

pub(crate) fn read_xml_file<P: AsRef<Path>>(file: P) -> Result<String, std::io::Error> {
    let file = file.as_ref();
    match std::fs::read_to_string(&file) {
        Ok(content) => {
            // roxmltree doesn't like heading and doctype so we remove it
            let content = XML_HEADING_RE.replace(&content, "");
            let content = XML_DOCTYPE_RE.replace(&content, "");
            return Ok(content.trim().to_string());
        },
        Err(e) => return Err(e),
    };
}
// main.rs
mod utils;
mod parser;

use roxmltree::Document;
use utils::read_xml_file;
use parser::parse_metadata;



fn main() -> Result<(), Box<dyn std::error::Error>> {
    let xml_content = read_xml_file("tests/spec_example.xml")?;
    let doc = Document::parse(&xml_content)?;
    let parsed = parse_metadata(&doc);
    println!("{:?}", parsed );
    Ok(())
}

Here's my current code if anyone wants to chime in with feedback on things that may not be idiomatic rust or could be done better. I'm excited and will move on to parsing the rest of the xml document after this :smile:

There's a repeating structure here that you can exploit to make the code - at least to my eye - a bit easier to read, as well. The structure is this one:

Some(CONSTANT) == meta.attribute("type").map(str::to_lowercase).as_deref()

First, the expression on the right can be factored out into a variable, so that the to_lowercase calculation and the associated allocation aren't done repeatedly and unnecessarily.

Second, those branches are all mutually exclusive, because if meta.attribute("type").map(str::to_lowercase).as_deref() is equal to Some("tag"), then it cannot be equal to any of the other constants checked (and the same holds for each of the other constants).

Those two observations mean that, in spite of my comments against matching against constant patterns above, using match against some constants may produce more readable code.

What I'm thinking is this:

if let Some(type) = meta.attribute("type").map(str::to_lowercase) {
  match &type {
    "title" => {
      title = title.or(meta.text().copied());
    }
    "category" => {
      category = category.or(meta.text().copied());
    }
    "password" => {
      if let Some(text) = meta.text().copied() {
        if !passwords.contains(&text) {
          passwords.push(text);
        }
      }
    }
    "tag" => {
      if let Some(text) = meta.text().copied() {
        if !tags.contains(&text) {
          tags.push(text);
        }
      }
    }
    _ => { /* ignore any other `type`; you could put other logic here */ },
  }
}

(copied() is an Option method that is exposed when the inner type implements, among other things, ToOwned, and &str implements ToOwned to convert to String.)

I like this structure because it pulls the mechanics of figuring out which type attribute is present up and out of the way, making the code that then does something for each type clearer. It also opens some further opportunities for factoring; with a bit of creativity, the two parallel constructs in the "password" and "tag" branches of this match could be extracted, for example - if it makes sense in context to do so.

There are probably other refactorings possible, both on top of this and instead of this, as you like. A lot of this is purely aesthetic, though the fact that your existing code re-calculates the lowercase form of the type attribute's value repeatedly would bug me operationally, as well.

2 Likes

I get this error when using .copied():

error[E0599]: no method named `copied` found for enum `Option<&str>` in the current scope
  --> src\parser.rs:41:50
   |
41 |                     title = title.or(meta.text().copied());
   |                                                  ^^^^^^ `Option<&str>` is not an iterator
   |
help: call `.into_iter()` first
   |
41 |                     title = title.or(meta.text().into_iter().copied());
   |                                                  ++++++++++++

Replacing it back with .map(String::From) did fix it and the newer structure is much more concise and readable so thank you!

    for meta in nzb.descendants().filter(|n| n.has_tag_name("meta")) {
        if let Some(typ) = meta.attribute("type").map(str::to_lowercase).as_deref() {
            match typ {
                "title" => {
                    title = title.or(meta.text().map(String::from));
                }
                "password" => {
                    if let Some(text) = meta.text().map(String::from) {
                      if !passwords.contains(&text) {
                        passwords.push(text);
                      }
                    }
                  }
                "tag" => {
                    if let Some(text) = meta.text().map(String::from) {
                      if !tags.contains(&text) {
                        tags.push(text);
                      }
                    }
                  }
                "category" => {
                    category = category.or(meta.text().map(String::from));
                }
                _ => {}
            }
        }
    }

I'm thrilled to announce that I've completed my first Rust project! It's now available on crates.io. I would also appreciate any further code reviews now that it's complete.

The experience has been incredibly positive. The Rust book's quality, Rustup's ease of installation, Cargo's project management, the compiler's informative errors, and this community's patience and helpfulness have all been instrumental. I'm definitely looking forward to writing more Rust!