Rust string literal definition as Regexp

I am trying to parse Rust source code to identify String literal tokens. I am relying on the Rust Reference for String literals but I can't find a complete RegExp that would satisfy the full definition of a Rust string literal:

A string literal is a sequence of any Unicode characters enclosed within two U+0022 (double-quote) characters, with the exception of U+0022 itself, which must be escaped by a preceding U+005C character ( \ ).

Line-breaks are allowed in string literals. A line-break is either a newline ( U+000A ) or a pair of carriage return and newline ( U+000D , U+000A ). Both byte sequences are normally translated to U+000A , but as a special exception, when an unescaped U+005C character ( \ ) occurs immediately before the line-break, then the U+005C character, the line-break, and all whitespace at the beginning of the next line are ignored.

Is there any regexp master here that could build up such pattern ?
Thanks

How about this regex?

"(\\.|[^\\"])*"

But I probably wouldn't parse this with a regex if I were to do it. You can just make a loop.

I'd recommend using https://crates.io/crates/rustc-ap-rustc_lexer for this task.

2 Likes

If you're parsing actual, arbitrary (i.e. not generated) Rust code, note that there are several contexts where " can occur as not part of a string literal, which will confuse the state machine:

  • in a comment
  • in a character literal
  • as the delimiter of a raw string literal, which has different rules than regular string literals
  • inside a raw string literal

These problems compound each other -- it's not enough to strip out the comments first, because, for example, "/*" is a string literal and not the beginning of a comment. Comments can also be nested, which makes the grammar context-free (not regular), so it can't be described by regular regular expressions (although it might technically be done with irregular "regular" expressions, using backreferences). Raw string literals are even worse, being context-sensitive. And that's without even considering macros...

tl;dr Use a lexer.

3 Likes

Thanks, this was working like a charm until I had to parse a long (> 818 characters) String literal.
Since my Rust parser is actually written in Java (Unfortunately I have no flexibility on choosing that language) , I am facing a known JDK limitation causing Stack overflow error.
I will try suggested workarounds in the bug ticket

In this case I would recommend just using a simple loop that looks at each character one at the time.