How to do Regexp matching

I'm doing multiple Regexp matches to parse a line from a logfile, so each match looks to extract different information from different lines. The following is what I have so far and I'm wondering if there's a neater way (using match on the 'captures' for example) than my if...else expresions which are rapidly getting out of hand? Thanks.

///! Capture state updates and return true if metrics were updated
pub fn parse_states(&mut self, entry: &LogEntry) -> bool {
	let mut updated = false;
	let re = Regex::new(
		r"^.*vault\.rs.*No. of Elders: (?P<elders>\d+)|(?x)
  (?-x)^.*vault\.rs.*No. of Adults: (?P<adults>\d+)|(?x)
  (?-x)^.*vault\.rs.*Initializing new Vault as (?P<initas>[[:alpha:]]+)|(?x)
  (?-x)^.*vault\.rs.*Vault promoted to (?P<promoteto>[[:alpha:]]+)(?x)",
	)
	.expect("Woops"); // TODO: make the expression a static (see LOG_LINE_PATTERN)

	if let Some(captures) = re.captures(entry.logstring.as_str()) {
		let elders = captures.name("elders").map_or("", |m| m.as_str());
		if !elders.is_empty() {
			self.parser_output = format!("ELDERS: {}", elders);
			updated = true
		} else {
			let adults = captures.name("adults").map_or("", |m| m.as_str());
			if !adults.is_empty() {
				self.parser_output = format!("ADULTS: {}", adults);
				updated = true
			} else {
				let agebracket = captures
					.name("initas")
					.or_else(|| captures.name("promoteto"))
					.map_or("", |m| m.as_str());
				self.parser_output = format!("Vault agebracket: {}", agebracket);
				if !agebracket.is_empty() {
					self.parser_output = format!("Vault agebracket: {}", agebracket);
					self.agebracket = match agebracket {
						"Child" => VaultAgebracket::Child,
						"Adult" => VaultAgebracket::Adult,
						"Elder" => VaultAgebracket::Elder,
						_ => VaultAgebracket::Unknown,
					};
					updated = true
				}
			}
		}
	}
	updated
}

I'm not sure if I've translated the code correctly (you've provided code that doesn't compile and has no tests), so you'll want to check that, but this is probably how I'd write it:

///! Capture state updates and return true if metrics were updated
pub fn parse_states(&mut self, entry: &LogEntry) -> bool {
    let mut updated = false;
    lazy_static! {
        static ref RE: Regex = Regex::new(r"(?x)
            ^.*vault\.rs.*No.\ of\ Elders:\ (?P<elders>\d+)
            |^.*vault\.rs.*No.\ of\ Adults:\ (?P<adults>\d+)
            |^.*vault\.rs.*Initializing\ new\ Vault\ as\ (?P<initas>[[:alpha:]]+)
            |^.*vault\.rs.*Vault\ promoted\ to\ (?P<promoteto>[[:alpha:]]+)(?x)"
        )
        .unwrap();
    }
    let caps = match re.captures(entry.logstring.as_str()) {
        None => return false,
        Some(caps) => caps,
    };
    
    if !caps["elders"].is_empty() {
    if let Some(elders) = caps.name("elders").map(|m| m.as_str()) {
        self.parser_output = format!("ELDERS: {}", elders);
        return true;
    }
    if let Some(adults) = caps.name("adults").map(|m| m.as_str()) {
        self.parser_output = format!("ADULTS: {}", adults);
        return true;
    }
    // This unwrap is safe since either initas or promoteto must have participated
    // in a match. Since if we're here, we know that neither elders or adults
    // participated in the match.
    let agebracket = caps.name("initas").or_else(|| caps.name("promoteto")).unwrap().as_str();
    self.parser_output = format!("Vault agebracket: {}", agebracket);
    self.agebracket = match agebracket {
        "Child" => VaultAgebracket::Child,
        "Adult" => VaultAgebracket::Adult,
        "Elder" => VaultAgebracket::Elder,
        _ => VaultAgebracket::Unknown,
    };
    true
}

The key technique here is to use early returns instead of nesting if statements ad infinitum. This is a generally useful technique across all programming languages, but is particularly useful in Rust.

Also, I fixed up your regex to avoid toggling back and forth between x mode. Instead, you can just escape spaces directly. Also, you have some unescaped . characters in your regex, and I suspect you meant to match a literal . instead of "any character." In that case, you want \. and not ..

3 Likes

Thanks, and yes you got it! Obvious solution to avoid the creeping indentation: return true :blush:

I agree the unwrap is safe now, but if someone [innocent face] comes back later and adds another 'clause' below this it is no longer safe, leaving a trap for the future no?

Your if !caps["elders"].is_empty() { lines don't work (they panic on lines without the named capture), but are redundant I think. I've now added parsing of the values to usize with error handling and the code is much improved (see below). It is still growing a bit long, but much easier to understand and modify IMO, so thanks very much. I have yet to write any tests (more learning still to do, so thanks for the nudge on that too).

///! Capture state updates and return true if metrics were updated
pub fn parse_states(&mut self, entry: &LogEntry) -> bool {
	let mut updated = false;
	let re = Regex::new(
		r"^.*vault\.rs.*No. of Elders: (?P<elders>\d+)|(?x)
		(?-x)^.*vault\.rs.*No. of Adults: (?P<adults>\d+)|(?x)
		(?-x)^.*vault\.rs.*Initializing new Vault as (?P<initas>[[:alpha:]]+)|(?x)
		(?-x)^.*vault\.rs.*Vault promoted to (?P<promoteto>[[:alpha:]]+)(?x)",
	)
	.expect("Woops"); // TODO: make the expression a static (see LOG_LINE_PATTERN)

	let captures = match re.captures(entry.logstring.as_str()) {
		Some(captures) => captures,
		None => return false,
	};

	if let Some(elders_str) = captures.name("elders").map(|m| m.as_str()) {
		self.parser_output = format!("ELDERS: {}", elders_str);
		match elders_str.parse::<usize>() {
			Ok(elders) => {
				self.elders = elders;
				return true;
			}
			Err(e) => {
				self.parser_output = format!("Error, invalid elders value '{}'", elders_str);
				return false;
			}
		}
	}

	if let Some(adults_str) = captures.name("adults").map(|m| m.as_str()) {
		self.parser_output = format!("ADULTS: {}", adults_str);
		match adults_str.parse::<usize>() {
			Ok(adults) => {
				self.adults = adults;
				return true;
			}
			Err(e) => {
				self.parser_output = format!("Error, invalid adults value '{}'", adults_str);
				return false;
			}
		}
	}

	if let Some(agebracket) = captures
		.name("initas")
		.or_else(|| captures.name("promoteto"))
		.map(|m| m.as_str())
	{
		self.parser_output = format!("Vault agebracket: {}", agebracket);
		self.agebracket = match agebracket {
			"Child" => VaultAgebracket::Child,
			"Adult" => VaultAgebracket::Adult,
			"Elder" => VaultAgebracket::Elder,
			_ => {
				self.parser_output = format!("Error, unkown vault agedbracket '{}'", agebracket);
				VaultAgebracket::Unknown
			}
		};
		return true;
	}

	false
}

Ah sorry about that! Forgot to remove them. The program given to me didn't compile, so I couldn't check if the program I gave back to you compiled without spending more effort than I wanted to. :wink: In any case, I had written caps["elders"] before I had seen the alternation in your regex. And when I noticed that, that was when I rewrote the regex to be (IMO) more readable.

In the future, when asking for help, it is always greatly appreciated to provide source code that compiles. Even better would be a playground link!

Yup, that's absolutely a valid concern and a potentially good reason to restructure the code in a way that doesn't admit future footguns. Whether it's appropriate in this case or not is up to your judgment. If you think it won't be uncommon to change the regex and the parsing code, then yeah, making it a bit more robust to more additions sounds like a great idea to me. Otherwise, writing tests for these kinds of routines should be very easy, so I might rely on those instead to catch bugs.

Also, popping up a level, you might consider making this code a bit more testable by breaking it out into a helper function with a signature like the following:

fn parse_states(log: &str) -> Result<Option<String>, ParseError>

That way you don't need to have what object self is and test changes made to its state. Instead, it's just a simple function with more constrained inputs and outputs. (I might not have gotten the signature exactly right. You might need something more than a simple String, e.g., to account for the age bracket stuff, but you get my gist.)

1 Like

I hadn't spotted the Regex expression change, I'll grab that :smile:

I've not got as far as tests or playground stuff yet, thanks for your help and suggestions.

PS Your Regex still works BTW :clap:

1 Like