Iterate on captures, either names or number?

Hi all,

Is there a way to iterate through all captures, being named capture group or a regular one. The captures_iter method is an iterator. But how to extract when we deal with either capture groups numbers or names ?

Ex:

use regex::Regex;

fn main() {
    let re = Regex::new(r"'(?P<title>[^']+)'\s+\((\d{4})\)").unwrap();
    let text = "'Citizen Kane' (1941), 'The Wizard of Oz' (1939), 'M' (1931).";
    for caps in re.captures_iter(text) {
        //println!("Movie: {:?}, Released: {:?}", &caps["title"], &caps["year"]);
        println!("caps={:#?}", caps);
    }
}

Use the enumerate() iterator method to handle a counter ?

Thanks for any hint.

I don't really understand your question or what problem you're trying to solve. It might be good to try and state what you're trying to do at a higher level.

Otherwise, have you looked at the docs for the Captures type? If not, please do. If so, why don't any of those methods help you?

Doc link: https://docs.rs/regex/1.4.2/regex/struct.Captures.html

1 Like

@BurntSushi Thanks for replying. I did look at the docs several times, and I'm actually use your crate in one of my project.

May be a more dedicated example is better:

use regex::Regex;

fn main() {
    let re = Regex::new(r"^(\w+) (\w+) (\w+) (?P<LASTNAME>\w+)").unwrap();
    let text = "President John Fitzgerald Kennedy";
    
    let caps = re.captures(text).unwrap();
    println!("caps={:#?}", caps);
    
    // if want to iterate to get:
    // (1, "President")
    // (2, "John")
    // (3, "Fitzgerald")
    // ("lastname", "kennedy")
    
    // here (1, "my name is") is not really a Rust tuple but you get the essence

}

I ended up to the following, but I feel this is sub-optimal:

for (i, cg_name) in re.capture_names().enumerate() {
    match cg_name {
        None => {
            if let Some(cg) = caps.get(i) {
                println!("({},{})", i, cg.as_str());
            }
        }
        Some(cap_name) => println!("({},{})", cap_name, caps.name(cap_name).unwrap().as_str()),
    };
}

Well, that doesn't really tell me the problem you're trying to solve. But I guess if you need that specific output, then your code looks about right to me. Why do you feel it is sub-optimal?

Now, for your particular regex, every single capture group is guaranteed to be present in a match, so there are a few case analyses that you can safely omit:

    for (i, cg_name) in re.capture_names().enumerate() {
        match cg_name {
            None => {
                println!("({},{})", i, &caps[i]);
            }
            Some(cap_name) => {
                println!("({},{})", cap_name, &caps[cap_name]);
            }
        }
    }

With that said, you were already using caps.name(cap_name).unwrap().as_str(), which is equivalent to &caps[cap_name] while simultaneously checking the return value of caps.get(i). Doing one and not the other doesn't really make sense. (Unless I guess you know that every named group must participate in a match while some unnamed groups don't participate in a match.) So, to handle it fully correctly for any regex, you would need to do:

    for (i, cg_name) in re.capture_names().enumerate() {
        match cg_name {
            None => {
                if let Some(m) = caps.get(i) {
                    println!("({},{})", i, m.as_str());
                }
            }
            Some(cap_name) => {
                if let Some(m) = caps.name(cap_name) {
                    println!("({},{})", cap_name, m.as_str());
                }
            }
        }
    }

Another approach:

    let mut index_to_name: HashMap<usize, String> = HashMap::new();
    for (i, name) in re.capture_names().enumerate() {
        if let Some(name) = name {
            index_to_name.insert(i, name.to_string());
        }
    }
    for (i, group) in caps.iter().enumerate() {
        let m = match group {
            None => continue,
            Some(m) => m,
        };
        if let Some(name) = index_to_name.get(&i) {
            println!("({},{})", name, m.as_str());
        } else {
            println!("({},{})", i, m.as_str());
        }
    }

But that requires allocating a hash map, and it's not really that much simpler. Really depends on what you're doing...

Again, I don't see what is sub-optimal about any of this.

Thanks a lot for your detailed answer.

I need to store all captures in a hashmap for later use. In my example it'll be something like this:

("Var1", "President")
("Var2", "John")
("Var3", "Fitzgerald")
("LASTNAME", "Kennedy")

where the previous are (key, value)'s to insert in the hash.

I said suboptimal because I thought there's a simpler and more elegant solution I couldn't find.