How can I improve this lexer?


#1

EDIT: It doesn’t even work properly, lol.
EDIT 2: I fixed it
I’ve recently started reading Crafting Interpreters and impleneting the Lox language in Rust.

Github repo with the code

I finished writing the lexer, but I’m not happy about it. The code doesn’t look good and it’s pretty slow.

I’d say one of the reasons why it is slow, is that when the characters are coming out of the iterator, I’m pushing them onto a buffer and then also copying the buffer into a Token, I know, that it’s inefficient, but don’t know of a better way to do it. In a perfect scenario, the Token field would just point to slices in the original source string.

So, how can I make the code more idiomatic and faster ?


#2

EDIT: Amended notes about memory allocations.

I had a brief look at your code, and the issue you mentioned is one of the issues I found.

Warning : I don’t write language parsers or lexers regularly, so take the following with salt.

Issues

  • Token storing String
    • Strings are heap allocated (see below on why this may be bad)
    • Subsequently creating a Token is a (very) expensive operation
  • Lexer storing both chars : Peekable<Chars<'c>> and buffer : String
    • From your usage, they are duplicate in purpose

Background

Things I think you may find useful, but just skip if you already know.

Why heap allocations may be bad

There are two ways to look at heap allocations, the bookkeeping performance of the allocator itself, and the performance overhead on your application, namely

  • The more randomly sized and small allocations you do, the less capable allocator is to avoid external memory fragmentations, this will hinder the performance of the allocator later on. This is only a practical issue when your process needs to not terminate for a really long time(or forever)
  • Allocator itself can be expensive. Normally not an issue, but the performance will be bad in tight loop, and in our case, your Lexer::scan_next is basically a tight loop as it goes through char by char.

TL; DR - Memory allocation on heap has high upfront cost, that is, each memory allocation comes with a base cost, but the cost does not increase significantly with the amount of memory requested. So prefer occasional, large allocations to frequent and small allocations.

(Stack allocations are very cheap)

Addressing the issues

Remember that lexer never really needs to modify the source(only needs immutable access to buffer!), so your Token never really needs to store the entire string segment by itself.
(If it only ever needs to store a character, then whatever, but I notice some of your token types require knowing a string segment rather than a char, so we’ll need to figure out how to provide a “view” into the buffer).

If you’re okay with lifetime Token being tied to Lexer(or Source, or Buffer, if you separate the source out), then just store &str inside your token.

If you need Token to have an independent lifetime by itself, then just store the starting position and length of string segment it needs to refer to.

The former is arguably less expensive, as the latter has a bound checking cost whenever you access the buffer, but you still have constant time access to your source buffer, so it’s not an issue really.


#3

Thank you very much for looking at my code and answering :slight_smile:

The more randomly sized and small allocations you do, the less capable allocator is to avoid external memory fragmentations, this will hinder the performance of the allocator later on. This is only a practical issue when your process needs to not terminate for a really long time(or forever)

I never thought of this, although it makes perfect sense.

If you’re okay with lifetime Token being tied to Lexer(or Source, or Buffer, if you separate the source out), then just store &str inside your token.

Although Tokens are needed in both parsing and evaluating, I think it would be OK if they were tied to the lifetime of ‘source’. Storing &str would be the cleanest solutionI think. Correct me if I’m wrong, but in order to do that, I will need to: store reference to source in the lexer and use the chars_indices() iterator instead of chars() so I know the indexes into the source.


#4

Yep, looks good, that would be what I would try to do first, and see how it goes.

Your token new would then look something like

Token::new(t_type, &source[first_pos..last_pos], whatever)

I just noticed you have a literal field in your token that causes performance hit similarly. Consider changing your definition of Literal from

#[derive(Debug, Clone)]
pub enum Literal {
    Nil,
    False,
    True,
    Number(f64),
    String(String),
}

to

#[derive(Debug, Clone)]
pub enum Literal {
    Nil,
    False,
    True,
    Number(f64),
    String,
}

Doing String(&str) might cause issues since your literal is allocated in the method scan_next() first, not really sure on this. You can still try, but not very necessary - just use the &str stored in token whenever it’s Literal::String, no need to store the &str twice.

Also I didn’t state very clearly - memory allocations on heap are expensive, memory allocations on stack are very cheap, as allocations on stack only requires incrementing stack pointer(which is like a few machine instructions).


#5

I did everything you said, and the lexer is now twice as fast, so that’s nice.
But I can’t get the parser to compile :frowning:

error[E0499]: cannot borrow `*self` as mutable more than once at a time
--> src/parser.rs:97:9
   |
89 |         if let Some(t) = self.matches(&[Bang, Minus]) {
   |                          ---- first mutable borrow occurs here
...
97 |         self.primary()
   |         ^^^^ second mutable borrow occurs here
98 |     }
   |     - first borrow ends here

It’s strange, because It used to work and I didn’t change the code at all. The only thing I changed is that I added lifetimes to Expr and Token. And I thought having references inside Token was ok :smile:


#6

So after reading the Lox’s grammar, I start to get what you’re doing in your code.

There are more than one mutable borrow issues in your code in a lot of places, which is usually not the case for parsers cause they don’t mutate things, but in your case you’re making a streaming parser using Peekable, so some mutation is required.

But since the mutation is only restricted to the Peekable, consider changing your Parser from

pub struct Parser<'c> {
    lexer: Peekable<Lexer<'c>>,
}

to

pub struct Parser<'c> {
    lexer: RefCell<Peekable<Lexer<'c>>>,
}

Or anything with interior mutabiity.

Then change your functions like addition, multiplication, unary from requiring mut self to self, the performance overhead of RefCell is not too terrible, basically some extra integer arithmetic required at each access or things alike(not an expert on Rust internals).

One more note about performance : Box is also heap allocation, think about whether you actually need Box or not(I have no clue as I didn’t analyse your code heavily, but boxing a lot can slow down your program, albeit not as bad as cloning string cause of cloning).

I’m not sure if you can remove the Box actually, but since you’re on git, just make a new branch and give it a shot.


#7

So yep your use of Box is correct, since Rust cannot handle types with recursive definition as it cannot caculate the sizes.

Last time I wrote a tree I used Vec(heap allocated as well), but that was for a cache, so the creation time doesn’t need to be very fast, but in your case, you need quite fast creation time, and boxing a lot doesn’t help.

Try the approach here. Basically allocate nodes on a Vec, then wherever you need a pointer/reference, just store the index to the node in the Vec.

You will see a similar technique quite often in C/C++, or even Java, called “pool allocator”, that is, they only allow allocation of one specifically sized object, but they avoid external fragmentation of memory entirely, and provides very fast allocation since most of the allocation is done at the start(by allocating, say, 2000 object space first).

Obviously Vec is not a proper allocator, as there’s no allocation bookkeeping stuff, but you should see by allocating a lot from the beginning, your overall cost is much smaller.

Again, memory allocation has a fairly high base cost(b), but the additional cost(t) does not increase significantly with size of allocation, so think of (b + t) + (b + t) + ... vs b + (t + t + t + t + ...). Boxing individually is closer to the first cost model, pool allocation is closer to the second cost model.

Caveat - logically this makes things faster, but currently start-of-the-art memory allocators make allocations very fast(e.g. jemalloc), maybe not as fast as pool allocators, but do verify at the end if memory allocation issues were causing the actual/expected slow down, or if there were other underlying issues.