Regular expression

How do you do a regular expression which will capture a multiple line comment? I have tried various things but nothing is working.

let re = Regex::new(m"/*.**/").unwrap();
let re = Regex::new(m"\/\*.*\*\/").unwrap();

I can't figure it out. :frowning:

Once upon a time a programmer had a problem, so he decided to use a regular expression. Now he has two problems.

playground

You need to enable multi-line mode. That's not done by appending an m before the regex string literal. Regex has no magic knowledge of Rust syntax, nor does Rust have such magical knowledge of regexes. If you want to ser a flag for the regex engine, you need to put that into the string describing the regex – but this is also covered in the documentation of the crate.

3 Likes

Something like this should work

let re = Regex::new(r"(?s)/\*.*\*/").unwrap();

The (?s) should enable multi line matching. The r” syntax should make it so that characters don’t need escaping in the string except ones that have relevance to regex (such as * which does need to be escaped)

Edit: fixed based on work character for multi line thanks to @H2CO3’s answer

1 Like

If this is for comments in Rust code, note that multiline comments can be nested, which can't be parsed with a strictly regular grammar: (this isn't the case for multiline comments in most other languages)

/*

comment

/*

nested comment

this doesn't end the whole comment like it would in C: */

still a comment

*/

Also, using the greedy repetition .* for the inner text is incorrect. It should be the lazy .*?.

5 Likes

Agreed. Trying to lex any nontrivial language with regexes is probably a terrible idea nowadays. If you OP intends to use this for anything serious, a hand-crafted lexer or maybe even a parser library, such as:

For just lexing you wouldn't need syn, just proc-macro2 - syn's only for parsing already-lexed tokens. Alternatively one could also use rustc_lexer which is the lexer used by Rustc itself, and so is slightly less macro-oriented than proc-macro2 (however it has no stability).

I'm actually using a lexer called Logos. It doesn't have to be too robust since I'm using it to create an assembler for my own 8-bit CPU, so it it is trivial ! :grin:

I appreciate all the advice, and the links.

Initially the only thing I could find in the documentation was the "m" for multiline.

m multi-line mode: ^ and $ match begin/end of line

The penny only dropped when I saw the example drewkett & you made and reading the documentation again I realized the section above was telling me how to use it. I probably still wouldn't have understood I should be using ?s without the example though.

(exp) numbered capture group (indexed by opening parenthesis)
(?Pexp) named (also numbered) capture group (allowed chars: [_0-9a-zA-Z])
(?:exp) non-capturing group
(?flags) set flags within current group
(?flags:exp) set flags for exp (non-capturing)

Thanks

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.