Regex: How to remove matching lines completely?

I would like to use the Regex::replace_all to match on certain expressions and under some circumstances remove a matching line completely.

However, if I do this naively, it will leave empty lines, see example below:

use regex::{Captures, Regex};

fn main() {
    let s1 = "\
        FOO\n\
        This is some multi-line text.\n\
        FOO\n\
        And this is the last line.\n\
        FOO\n\
    ";
    // The following will leave empty lines:
    let s2 = Regex::new(r#"(?m)^FOO$"#).unwrap().replace_all(s1, "");
    // This does an unnecessary allocation:
    /*
    let s2 = Regex::new(r#"(^|\n)FOO(\n|$)"#).unwrap().replace_all(
        s1,
        |caps: &Captures| caps[1].to_owned(),
    );
    */
    // And this is just plain ugly:
    /*
    let s2 = Regex::new(r#"(^|\n)FOO(\n|$)"#).unwrap().replace_all(
        s1,
        |caps: &Captures| {
            match &caps[1] {
                "" => "",
                "\n" => "\n",
                _ => unreachable!(),
            }
        }
    );
    */
    println!("{s2}");
}

(Playground)

Output:


This is some multi-line text.

And this is the last line.



What's the best way to do this without ending up with ugly expressions or complex code?

Regex::new(r#"FOO\n"#)

Doesn't work:

fn main() {
    let s1 = "\
        TATAFOO\n\
        This is some multi-line text.\n\
        TATAFOO\n\
        And this is the last line.\n\
        TATAFOO\n\
    ";
    // The following will leave empty lines:
    let s2 = Regex::new(r#"FOO\n"#).unwrap().replace_all(s1, "");
    println!("{s2}");
}

(Playground)

Output:

TATAThis is some multi-line text.
TATAAnd this is the last line.
TATA

Your requirements are very unclear. Another option may be Regex::new(r#".*FOO\n"#).

Regarding the toy example: I want to remove lines that are "FOO" (i.e. remove the line if there is only "FOO" but no other content on that line).

My real use-case is a bit more complex (I will have a condition under which the line gets removed), but I would like to try to solve the easy case first.


If a line consists of "TATAFOO", I don't want to touch it.

The proposal using r#"FOO\n"# will mess up the output if any line ends with "FOO", and the proposal using r#".*FOO\n"# will remove any line that ends in "FOO", which I also don't want.

Regex::new(r#"(?m:^FOO)\r?(\n|$)"#)

3 Likes

Thanks, very nice!

It occurred to me that (?m)^FOO$\n? is a potentially nicer-looking, mostly-equivalent version. [1]

How to think of all this

^ and $ are zero-width[2] matchers. You can think of them as matching the positions between two characters, represented by _ below. Normally they match the beginning-of-text and end-of-text. In multi-line mode, ^ matches the beginning-of-text or after a newline, and $ matches the end-of-text or before a newline. [3]

# (?m)^FOO$
_ F _ O _ O _ \n _ A _ \n _ F _ O _ O _ \n _ B _ \n _ F _ O _ O _
^ F   O   O $             ^ F   O   O $             ^ F   O   O $
~~~~~~~~~~~~~             ~~~~~~~~~~~~~             ~~~~~~~~~~~~~

# (?m:^FOO)(\n|$)
_ F _ O _ O _ \n _ A _ \n _ F _ O _ O _ \n _ B _ \n _ F _ O _ O _
^ F   O   O   \n          ^ F   O   O   \n          ^ F   O   O $
~~~~~~~~~~~~~~~~          ~~~~~~~~~~~~~~~~          ~~~~~~~~~~~~~

# (?m)^FOO$\n?
_ F _ O _ O _ \n _ A _ \n _ F _ O _ O _ \n _ B _ \n _ F _ O _ O _
^ F   O   O $ \n          ^ F   O   O $ \n          ^ F   O   O $
~~~~~~~~~~~~~~~~          ~~~~~~~~~~~~~~~~          ~~~~~~~~~~~~~

So with (?m)^FOO$ you were leaving blank lines because the end-of-line \ns weren't part of the matches. My first reply replaced (?m:$) with something equivalent[4] that included the newline if present,[5] (?-m:\n|$). This reply instead just looks for a newline after the required zero-width position (?m:$).


  1. (?mR)^FOO$\r?\n? or (?m)^FOO\r?$\n? if you care to support \r\n style newlines. ↩︎

  2. aka empty ↩︎

  3. Without the R flag, \n is the only recognized newline; with it, $ matches before \r\n as well. ↩︎

  4. -ish (modulo newline stuff) ↩︎

  5. so not necessarily zero width ↩︎

1 Like

While I think Lua's patterns are less powerful than ordinary regular expressions, they provide something called a frontier pattern that's described in the Patterns section of the Lua manual:

%f[set], a frontier pattern; such item matches an empty string at any position such that the next character belongs to set and the previous character does not belong to set. The set set is interpreted as previously described. The beginning and the end of the subject are handled as if they were the character '\0'.

(Side note: this sometimes makes dealing with null-bytes difficult in Lua, if you use these.)

It looks like ^ and $ in multi-line mode are such frontier patterns, except that in Rust these only work for the fixed pattern \n or \r\n, respectively. I assume the regex expressions don't have other zero-width matchers?


:thinking: … maybe this is what the non-capturing group can do in Rust's regex.

I think it's just those special cases. The regex terms are look-ahead / look-behind / look-around. From the docs:

The regex syntax supported by this crate is similar to other regex engines, but it lacks several features that are not known how to implement efficiently. This includes, but is not limited to, look-around and backreferences. In exchange, all regex searches in this crate have worst case O(m * n) time complexity, where m is proportional to the size of the regex and n is proportional to the size of the string being searched.

Not sure how feasible a fixed width lookaround feature is, or if the lower level crates could support it... fun area to research potentially. (On mobile so I didn't try to find anything preexisting.)

1 Like

There are also \A and \Z which match the beginning and end of the input even in multiline mode, \b which matches the start and end of a word (as defined by \w and \W), and \B which matches every position not a word boundary.

Ah yeah -- I meant the special cases I linked to before [1] and not just ^ and $.


  1. albeit in one of these inline footnote dealies ↩︎

Related is that the line terminator used by (?m:^) and (?m:$) is now customizable to any byte value: RegexBuilder in regex - Rust

Also, (?Rm:^) and (?Rm:$) both treat \r and \n as line terminators but will not match between a \r and \n.

(Both of these features were added in regex 1.9.)

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.