Looking for feedback on my first tokenizer in Rust

Hey everyone, I've been trying to learn about building programming languages and thought I'd make it doubly hard for myself and also learn Rust at the same time :sweat_smile:

I've been learning about the lexing/tokenizing process, and so I'm first focusing on building a small program which can handle arithmetic expressions.

I'll share the code for the tokenizer here (the file is around 60 LOC) if anyone has any feedback, and also some of my own notes as well.

#[derive(Debug)]
enum TokenType {
    INT,
    ADD,
    SUB,
    MUL,
    DIV,
    LPAREN,
    RPAREN,
}

#[derive(Debug)]
struct Token {
    token_type: TokenType,
    token_value: String,
}

fn main() {
    let tokens = tokenize("42 + 123");
    println!("{:?}", tokens);
}

fn tokenize(expr: &str) -> Vec<Token> {
    let mut tokens: Vec<Token> = vec![];
    let mut current_number = String::new();

    for c in expr.chars() {
        if c.is_ascii_digit() {
            current_number.push(c);
        } else {
            if !current_number.is_empty() {
                tokens.push(build_token(TokenType::INT, current_number.clone()));
                current_number.clear();
            }
        }

        match c {
            ' ' => continue,
            '(' => tokens.push(build_token(TokenType::LPAREN, String::from("("))),
            ')' => tokens.push(build_token(TokenType::RPAREN, String::from(")"))),
            '+' => tokens.push(build_token(TokenType::ADD, String::from("+"))),
            '-' => tokens.push(build_token(TokenType::SUB, String::from("-"))),
            '*' => tokens.push(build_token(TokenType::MUL, String::from("*"))),
            '/' => tokens.push(build_token(TokenType::DIV, String::from("/"))),
            _ => continue,
        }
    }

    if !current_number.is_empty() {
        tokens.push(build_token(TokenType::INT, current_number));
    }

    tokens
}

fn build_token(token_type: TokenType, token_value: String) -> Token {
    Token {
        token_type,
        token_value,
    }
}

Some things I've been thinking about where it could be improved:

  • Not entirely sure I need the enum for the types, or rather maybe I should use something else here?
  • Tokenizer itself feels a little messy with the if and match both being used. But I couldn't think of another way to handle multiple digits/numbers otherwise.
  • I'm parsing everything as a string which I think makes sense.
  • Thinking about if I was to extend this with other tokens like keywords, variables, etc then this feels like it could get real messy real fast. Are lexers/tokenizers really implemented this way with large match statements? Maybe there's some refactoring I could do somehow?

As mentioned, I'm very new to Rust so I'm focusing less on the intricacies of the language and memory model right now, and trying to work out the best way to express the logic for the tokenizer. But I also understand that the more I understand Rust, the likely the more idiomatic I can write it.

If anyone manages to take a look at this post, thank you!

It’s quite normal to have an enum for your tokens. What isn’t as common is what you’ve done where you also include strings. This can be useful in a lexer which has the goal of fully preserving the input text (which might use &str instead, for efficiency) but without that goal, it would be better to store only the token enum. The only time you need a string in the token is when it’s a string-literal token or similar.

#[derive(Debug)]
enum Token {
    Int(u128),
    Add,
    Sub,
    Mul,
    Div,
    LParen,
    RParen,
}

You can write

match c {
    '0'..='9' => { ... add c to digit accumulator ... }

Another common pattern is to call a function to read the entire number. This requires you use expr.chars().peekable() as your iterator, so the number reader can peek ahead to see if there is a non-digit ahead, without consuming it.

Yes.

Some quick observations:

  1. Your Token model could be improved. You have a mandatory token_value field, but of all the token types, only INT needs associated data.

I would model it like this instead:

#[derive(Debug)]
enum Token {
    INT {
      value: usize
    },
    ADD,
    SUB,
    MUL,
    DIV,
    LPAREN,
    RPAREN,
}

Lexers can get quite complicated, very quickly. Yes, you will always need a sum type (enums, in Rust) to represent a union of all the possible tokens, but parsing step isn't always as easy as iterating through a list of characters.

So I don't need to store the values of the token (as it's clear by the type of variant specified in the enum), except for digits right?

I'll also take a look at refactoring it with the function you suggested. Thank you!

I didn't know I could specify values to put in enums like that, so thank you for the code example.

Lexers can get quite complicated, very quickly. Yes, you will always need a sum type (enums, in Rust) to represent a union of all the possible tokens, but parsing step isn't always as easy as iterating through a list of characters.

It's good to know I'm not too far off then, haha.

Appreciate the reply :slight_smile:

I tend to write my tokenizers as "pull", using a next() method to return the next token type, remembering the positions of the start and end of the token:

struct Tokenizer {
    pub source: String, // generally, actually a str but I'm simplifying
    pub start: usize,
    pub pos: usize,
}

impl Tokenizer {
    ...

    pub fn next(&mut self) -> Token {
        self.start = self.pos;
        let Some(c) = self.source.get(self.pos) else {
            return Token::End;
        };
        self.pos += c.utf8_len(); // or use source.as_bytes()
        match c {
            ' ' | '\t' | '\r' | '\n' => Token::Space,
            '0'..='9' => {
                while let Some(c) = self.source.get(self. pos)
                    && c.is_ascii_digit()
                {
                    self.pos += c.utf8_len();
                }
                Token::Int
            }
            ...
            _ => Token::Invalid,
        }
    }

    pub fn slice(&self) -> &str {
        &self.source[self.start..self.pos]
    }
}

This approach means you can easily make highlighters and incrementally tokenize, including in more complex languages where lexing is context dependant (eg language injection, JavaScript's "/" being either division or the start of a regex). This means you can even not bother keeping anything but the token start and just retokenize from there if you need the source/length, eg. if error reporting, which turns out to be faster than keeping them around.

You might notice I don't use Option or Result as is common, I tend to find it makes common uses of a lexer more complicated with little value. Another trick is keeping the last Token as a field as a "peek", though I find that makes returning the token from next confusing when reading the using code. Just not returning the token makes the API read quite well in my experience.

Hey, thanks for sharing this, feel like I can learn a lot from it!

I need to get a better understanding I think of tokenizers in general to really appreciate the benefits from your approach, but once that knowledge is there and I try out something more complex, I'll definitely be coming back to this for another look (and likely refactor haha)

Oh, yeah. I basically never end up with quite the same code, depending on exactly what I'm needing. For example, I sometimes have a next_all() that steps over each individual token including "triva" like whitespace and comments used for e.g. highlighting, and a next() that calls that, skipping over trivia that's not used by the parser.

I should also mention Logos — Parser // Lib.rs which is a very cool library to create very fast and powerful lexers, but I do find I have to wrap it up to get an ergonomic enough API for my parsers anyway. Give it a try too!