EDIT: Amended notes about memory allocations.
I had a brief look at your code, and the issue you mentioned is one of the issues I found.
Warning : I don’t write language parsers or lexers regularly, so take the following with salt.
- Strings are heap allocated (see below on why this may be bad)
- Subsequently creating a
Token is a (very) expensive operation
Lexer storing both
chars : Peekable<Chars<'c>> and
buffer : String
- From your usage, they are duplicate in purpose
Things I think you may find useful, but just skip if you already know.
Why heap allocations may be bad
There are two ways to look at heap allocations, the bookkeeping performance of the allocator itself, and the performance overhead on your application, namely
- The more randomly sized and small allocations you do, the less capable allocator is to avoid external memory fragmentations, this will hinder the performance of the allocator later on. This is only a practical issue when your process needs to not terminate for a really long time(or forever)
- Allocator itself can be expensive. Normally not an issue, but the performance will be bad in tight loop, and in our case, your
Lexer::scan_next is basically a tight loop as it goes through char by char.
TL; DR - Memory allocation on heap has high upfront cost, that is, each memory allocation comes with a base cost, but the cost does not increase significantly with the amount of memory requested. So prefer occasional, large allocations to frequent and small allocations.
(Stack allocations are very cheap)
Addressing the issues
Remember that lexer never really needs to modify the source(only needs immutable access to buffer!), so your
Token never really needs to store the entire string segment by itself.
(If it only ever needs to store a character, then whatever, but I notice some of your token types require knowing a string segment rather than a char, so we’ll need to figure out how to provide a “view” into the buffer).
If you’re okay with lifetime
Token being tied to
Buffer, if you separate the source out), then just store
&str inside your token.
If you need
Token to have an independent lifetime by itself, then just store the starting position and length of string segment it needs to refer to.
The former is arguably less expensive, as the latter has a bound checking cost whenever you access the buffer, but you still have constant time access to your source buffer, so it’s not an issue really.