EDIT: Amended notes about memory allocations.
I had a brief look at your code, and the issue you mentioned is one of the issues I found.
Warning : I don't write language parsers or lexers regularly, so take the following with salt.
Issues
-
Token
storing String
- Strings are heap allocated (see below on why this may be bad)
- Subsequently creating a
Token
is a (very) expensive operation
-
Lexer
storing both chars : Peekable<Chars<'c>>
and buffer : String
- From your usage, they are duplicate in purpose
Background
Things I think you may find useful, but just skip if you already know.
Why heap allocations may be bad
There are two ways to look at heap allocations, the bookkeeping performance of the allocator itself, and the performance overhead on your application, namely
- The more randomly sized and small allocations you do, the less capable allocator is to avoid external memory fragmentations, this will hinder the performance of the allocator later on. This is only a practical issue when your process needs to not terminate for a really long time(or forever)
- Allocator itself can be expensive. Normally not an issue, but the performance will be bad in tight loop, and in our case, your
Lexer::scan_next
is basically a tight loop as it goes through char by char.
TL; DR - Memory allocation on heap has high upfront cost, that is, each memory allocation comes with a base cost, but the cost does not increase significantly with the amount of memory requested. So prefer occasional, large allocations to frequent and small allocations.
(Stack allocations are very cheap)
Addressing the issues
Remember that lexer never really needs to modify the source(only needs immutable access to buffer!), so your Token
never really needs to store the entire string segment by itself.
(If it only ever needs to store a character, then whatever, but I notice some of your token types require knowing a string segment rather than a char, so we'll need to figure out how to provide a "view" into the buffer).
If you're okay with lifetime Token
being tied to Lexer
(or Source
, or Buffer
, if you separate the source out), then just store &str
inside your token.
If you need Token
to have an independent lifetime by itself, then just store the starting position and length of string segment it needs to refer to.
The former is arguably less expensive, as the latter has a bound checking cost whenever you access the buffer, but you still have constant time access to your source buffer, so it's not an issue really.