Request for review: easy extraction of parts of string using regexes

Hi all. This request for review is less about the regex parts (I already know there's problems, e.g. I'm not escaping regex characters) and more about how I've gone about doing it in Rust. I would like to understand how I can turn this into more idiomatic rust code.

extern crate regex;
use regex::Regex;
use std::collections::HashMap;

fn extract_tokens(input: &str, pattern: &str) -> HashMap<String, String> {
    let mut pattern_str = String::from(pattern);

    let token_re = Regex::new(r"\{(.*?)\}").unwrap();

    for cap in token_re.captures_iter(pattern) {
        let cap = String::from(&cap[0]);

	let repl = &format!("(?P<{}>.+)", &cap[1..cap.len()-1]);

        pattern_str = pattern_str.replace(&cap, repl);
    }

    let mut out = HashMap::new();

    let group_re = Regex::new(&pattern_str).unwrap();
    for group in group_re.captures_iter(input) {
        for name in group_re.capture_names() {
            match name {
                Some(n) => {
                    out.insert(
                        String::from(n),
                        String::from(group.name(n).unwrap().as_str()),
                    );
                }
                None => {}
            }
        }
    }

    out
}

fn main() {
    println!(
        "{:?}",
        extract_tokens(
            "My name is Inigo Montoya",
            "My name is {first_name} {last_name}"
        )
    );
}

// outputs: {"last_name": "Montoya", "first_name": "Inigo"}

You should stuff your regex into a once_cell::sync::Lazy or a lazy_static

use once_cell::sync::Lazy;
static TOKEN_RE: once_cell: Lazy<Regex> = Lazy::new(|| Regex::new(r"\{(.*?)\}").unwrap());

You should also split out compiling the pattern and extracting the tokens, that way you can reuse the pattern for multiple extractions. (something like this)

(on a separate note: it seems strange that regex::Matches doesn't have a way to query the name of the match, that way you could just or regex::Capture doesn't have a way to iterate over all the capture groups with their names)

Neat. Thanks!

Does this make it so that TOKEN_RE is only initialized once and then the same object remains in memory for the duration of the runtime?

Exactly, it will be initialized the first time it's used, and never again.

A couple more comments, in no particular order:

  • extern crate is not needed anymore.
  • You have looots of indentation. The looks of the code should generally be cleaned up.
  • let cap = String::from(&cap[0]); allocates superfluously.
  • It is wasteful to convert the pattern to a String if the pattern does not have any capturing groups. Use a Cow.
  • It also seems wasteful to copy the entire string for each replacement. Use find() and push_str() onto the same string buffer.
  • Similarly, you can return a HashMap<String, &str> since the matches only need to be views into the original input. With a bit of additional trickery, you could even try to remember subslices of the original pattern and map them to the capture group names as keys, which would also enable you to return a HashMap<&'b str, &'a str> (note the distinct lifetimes).
  • cap[1..cap.len() - 1] assumes that 1 and len() - 1 are valid UTF-8 boundaries, which is not true unless the first and the last characters of the string are ASCII. (Are you trying to strip the {} delimiters? It's better to use the trim_() methods for that.)
  • But anyway, Regex has a replace_all() method. You can just use it instead of implementing the replacement and copying manually.
  • .+ matches any character (as long as there is one or more), which might not work as you expected. See the second example in the playground.
  • You can rewrite the let mut + for…in + match using an iterator and collect() + filter_map().
  • You should handle the case when the input pattern is invalid. I'm still thinking about a concrete way of abusing the fact that you are building a regular expression from arbitrary dynamic input, but for sure it seems like a dangerous idea.
  • The code doesn't currently handle the case of unbalanced braces, either.

Here's a playground with the improved code, and I reproduce extract_tokens() below:

fn extract_tokens<'a>(input: &'a str, pattern: &str) -> Result<HashMap<String, &'a str>, Error> {
    static TOKEN_RX: Lazy<Regex> = Lazy::new(|| Regex::new(r"\{(.*?)\}").unwrap());

    let pattern = TOKEN_RX.replace_all(pattern, "(?P<$1>.+)");
    let group_rx = Regex::new(&pattern)?;

    let caps = match group_rx.captures(input) {
        Some(captures) => captures,
        None => return Ok(HashMap::new()),
    };
    let res = group_rx
        .capture_names()
        .filter_map(
            |name| name.and_then(
                |name| caps.name(name).map(|value| (name.into(), value.as_str()))
            )
        )
        .collect();

    Ok(res)
}

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.