I wrote this in response to @newpavlov's comment in the thread on Modern C++ memory safe, but realized it got a bit more sprawling than I intended and figured it might make as a nice topic for a new thread.
The main context here was discussion about, I think, improving the reliability of programs through voluntary restrictions on how the code for those programs is written. This particular restriction is about recursion. The main pain point of recursion is that if the call depth is somehow connected to the size of untrusted or user input, then it's likely possible to craft inputs that result in a stack overflow and thus possibly provoke a denial of service vulnerability.
Since the regex crate tries to make itself suitable for untrusted haystacks and untrusted regexes, its parser goes out of its way to avoid the aforementioned recursion pitfall. It does this via two methods:
- When parsing the concrete syntax of a regex into an
Ast
, it never uses recursion and instead pushes what would be on the call stack to an explicit stack on the heap. - During parsing, it checks the "nesting depth" of the resulting
Ast
. If it's too big, then an error is returned. An important invariant of this check is that the size of theAst
is proportional to the size of the concrete syntax. (For example, you can't write\pL{100000}
and get a monstrousAst
. TheAst
doesn't expand repetitions and doesn't even try to resolve\pL
(which is huge on its own) into a sequence of non-overlapping codepoint ranges.)
The regex-syntax
crate also tries to avoid recursion in other places too, but is pretty hit or miss. For example, Ast->Hir translation doesn't use recursion, but Hir->NFA does. The Hir printer doesn't use recursion, but literal extraction does. The Display
impl for Hir doesn't use recursion, but its Debug
, Clone
and PartialEq
impls do. So now I'm considering removing the recursion mitigations on Hir
at least, and just leaving them in place for the Ast
. The idea here is that since we avoid recursion and check a nesting depth, maybe that's good enough.
Anyway, what follows is my original reply.
Since I was pinged with respect to restrictions in regex-syntax
regarding recursion, I'll add a few quick notes.
I have personally found that avoiding recursion to be quite annoying. The most natural representation for things like an AST is via a recursive data type. And many of the algorithms on those types are most naturally conceived via structural induction on the type. The difference in code complexity between structural recursion and using something like a visitor is enormous. The AST->HIR translator in particular is pretty gruesome.
Avoiding recursion is annoying enough that once the AST is parsed and translated into the HIR, I've generally tended to just use recursion. In particular, for literal extraction and for converting the HIR into an NFA. (There is a heap based visitor for the HIR which I've used in some places, e.g., for a printer, but it has largely been a failure because I hate using it.) The reason why this works is because a nest limit is imposed. Namely, since recursion is not used in the initial parsing of the AST and because parsing will fail if the AST is too deeply nested, then using recursion downstream from that tends to be okay. There's certainly no guarantee of course, because who's to say that my arbitrarily chosen nest depth has anything to do with the stack size configured by the caller? Nada! But... the heuristic seems to work in practice, and my evidence of that is "people haven't complained." (Which is maybe not great evidence.) It really only gets tested if people are compiling untrusted regexes, and that is quite a precarious thing to do, even if the parser guards against overflow.
On top of all of that, despite ensuring that the Drop
and Display
impls of Hir
are implemented without recursion, its Clone
and Debug
and PartialEq
impls are not. Seems kind of like a bad situation if you ask me. :-/
So what's my point? I don't know. I don't know if I have a strong point to make here other than that mitigating against recursion is supremely annoying, difficult to get right and I'm not totally clear on how much it helps when compared against simpler heuristics such as a nest depth.
(There are also different patterns to recursion other than a Visitor
-like trait, and maybe investigating some of those approaches would help make things easier. And yes, I've read this post on recursion schemes in Rust, but it has its own downsides and appears to be strongly motivated by perf, which isn't as much of a concern in regex-syntax
.)