An suggestions/improvements for my lexer?


#1

I’m currently working on implementing a compiler (from Java euk) into Rust to aid my learning. I’d really appreciate suggestions for making my code more efficient, more readable or generally just more ‘rusty’.

You can find it at https://gist.github.com/chickencoder/72cf548bba84ab8bad7d35aa8a4ed3a9

Many Thanks
:smile:


#2

Nice clean code! Here are some rust-specific suggestions:

  1. buf.len() is the number of bytes in buf, not the number of chars so your offsets might be wrong. For example, line 43 may sometimes panic (it could try to slice into the middle of a character). Consider iterating over bytes instead of chars to fix this (no valid ASCII byte will ever appear in the middle of a multi-byte Unicode code unit so this is safe for your purposes). This is confusing so please ask questions.
  2. On line 44, you don’t need the Vec. You can just use a fixed size array.
  3. On line 46, why .eq and not ==? (FYI, unlike Java, rust has operator overloading and == is sugar for the eq method. As a matter of fact, you can just use the .contains(value) method here.
  4. Unless you plan on modifying tokens before you return them, you should consider returning slices into buf instead of allocating a new string per token. This will be significantly faster. Token would become pub struct Token<'a> { pos: usize, val: &'a str }.
  5. Consider deriving traits (Debug, Clone, Copy, Eq, etc.) where appropriate. See https://doc.rust-lang.org/book/traits.html#deriving
  6. Rust has a break keyword so you don’t need the cont variable in the loop on line 63.

Once you’re done, you should (unless your requirements change) be able to parse without allocating.


#3

That’s really helpful! Thanks a lot


#4

So I notice you’re not using iterators.

Here’s a lexer that I wrote for DNS txt records, it uses a lookahead of 1 with the Peekable iterator:

https://github.com/bluejekyll/trust-dns/blob/master/src/serialize/txt/master_lex.rs

I’m not sure it’s the best that it could be, but instead of doing some of that string matching your do, I try to use the State struct to hold that. The Peekable is the thing the Lexer holds, rather than the array. I think this should make it more portable across different input methods as well.

You have a great start there! And I agree with everything in the previous comment.