use lazy_regex::regex;
pub(crate) fn sanitize_xml(xml: impl Into<String>) -> String {
let xml = xml.into();
let xml_heading_re = regex!(r"(?i)<\?xml\s+version.*\?>");
let xml_doctype_re = regex!(r"(?i)<!DOCTYPE.*>");
let content = xml_heading_re.replace(&xml, "");
let content = xml_doctype_re.replace(&content, "");
content.trim().to_string()
}
Is this appropriate? What can I improve here? the regex crate specifically calls out in-function regexes as an anti-pattern, so a bit of googling later I ended up with lazy-regex.
Your regexes don’t seem to be dynamically constructed so they could be factored out of the function so they don’t have to be compiled on every call. (That’s what the antipattern-ness is about.) But whether it’s worth it really depends on how often the function is called.
Yes, it does. lazy-regex also provides some other niceties, but note that you can use std::sync::LazyLock as a dependency-free way of compiling the regex only once, while also still defining the regexes local to the function that use them.
What kinds of improvements are you looking for? Your code looks pretty reasonable to me. It looks like something I would write in a context where I didn't care about perf. I'm also not an XML expert, and I'm not sure whether the replacements here are sufficient to "sanitize" XML.
Regex::replace accepts a &str so you may want to change the xml parameter to &str. This may avoid an allocation in the Into.
Regex::replace returns a Cow<str> so you may want to return the same type. Depending how you use the return value, this may save an allocation.
But if you expect that these patterns should occur as the first items in an XML document (as they typically do), you can rewrite this to look for each pattern only at the start of the document and, if found, split it from the rest of the input &str. This saves scanning the entire XML document at the cost of changing the behavior if these patterns occur later in the document.
use lazy_regex::regex;
pub(crate) fn sanitize_xml(xml: &str) -> &str {
let xml_heading_re = regex!(r"^(?i)<\?xml\s+version.*\?>");
let xml_doctype_re = regex!(r"^(?i)<!DOCTYPE.*>");
let mut content = xml.trim();
if let Some(match) = xml_heading_re.find(content) {
content = &content[match.end..];
content = content.trim_start();
}
if let Some(match) = xml_doctype_re.find(content) {
content = &content[match.end..];
content = content.trim_start();
}
content
}
I'm open to any and all feedback, since the function does work, I just wasn't sure if it's idiomatic. I'm slowly working through my first rust project here and the community here has been very welcoming and helpful. The above function is just a small part of it. The string being XML is not too important and the sanitization is certainly not enough for XML but it is enough for the XML I'm working with. My question is mostly about writing an idiomatic function that accepts a string, does some regex replacing, and returns the changed string.