Problems writing a working lexer in Rust

I was writing code for my own programming language, and I keep running into an error where my code runs forever with no error message. Heres my code (and yes I know my style of importing is wierd, don't comment on that)

use crate::tokenz::TOKENS;

pub enum Operator {
    Plus,
    Minus,
    Division,
    Multiplication,
    Lparanthesis,
    Rparanthesis,
    Float(f64),
    Int(i64),
}


pub mod tokenz {
    pub static mut TOKENS: Vec<crate::Operator> = vec![]; 
}

#[derive(Debug, Clone)]
pub struct Lexer {
    pub text: Vec<char>,
    pub pos: usize,
    pub current_char: char,
}
impl Lexer {
    pub fn advance(mut self) {
        self.pos += 1;
        self.current_char = self.text[self.pos];
    }
    pub fn make_number(&mut self) -> Operator {
        const DIGIT: &str = "0123456789";
        let mut num_str = String::new();
        let mut dot_count = 0;
        while DIGIT.find(self.current_char).is_some() || self.current_char == '.' {
            if self.current_char == '.' {
                if dot_count == 1 {
                    break;
                }
                dot_count += 1;
            } else {
                num_str.push(self.current_char);
                self.clone().advance();
            }
        }
        if dot_count == 0 {
            return Operator::Int(num_str.parse::<i64>().unwrap())
        } else {
            return Operator::Float(num_str.parse::<f64>().unwrap())
        }

    }
    pub fn make_tokens(&mut self) {
        const DIGIT: &str = "0123456789";
        while self.current_char != '\0' {
                match self.current_char {
                ' ' => self.clone().advance(),
                '+' => {
                    unsafe { TOKENS.push(Operator::Plus) };
                    self.clone().advance();
                }
                '-' => {
                    unsafe { TOKENS.push(Operator::Minus) };
                    self.clone().advance();
                }
                '*' => {
                    unsafe { TOKENS.push(Operator::Multiplication) };
                    self.clone().advance();
                }
                '/' => {
                    unsafe { TOKENS.push(Operator::Division) };
                    self.clone().advance();
                }
                '(' => {
                    unsafe { TOKENS.push(Operator::Lparanthesis) };
                    self.clone().advance();
                }
                ')' => {
                    unsafe { TOKENS.push(Operator::Rparanthesis) };
                    self.clone().advance();
                }
                _ => {
                    if DIGIT.find(self.current_char).is_some() {
                        self.make_number();
                    } else {
                        self.clone().advance();
                    }
                }
            }
        }
    }
}

use std::{io};
// use std::{fs, io, io::Read};

fn main() {
    /*
    println!("welcome to loxable shell, please input file name.");
    let mut filename = String::new();
    io::stdin().read_line(&mut filename).expect("not a valid input. (please dont use emojis or etc)");
    let filename = filename.trim();
    let mut args = String::new();
    let mut file = fs::File::open(filename).unwrap();
    file.read_to_string(&mut args).unwrap();
    dbg!(file);
    let args = args.trim();
    // println!("would you like to use untested/unstable version? (y/n)   \n");
    println!("{}",args);

    if args.starts_with("loxabla beta/") {
        println!("version type declared succesfully.");
    } else if !args.starts_with("loxable beta/") {
        panic!("current version of loxable is unavailable in this version, please upgrade your loxable version.");
    }
    */
    println!("welcome to loxable shell, alpha version.");

    let mut input = String::new();
    io::stdin().read_line(&mut input).expect("not a valid ascii string");
    let input = input.trim();
    let input: Vec<char> = input.chars().collect();
    let mut to_lex = Lexer { text: input, pos: 0, current_char: ' '  };
    let to_lex2 = to_lex.make_tokens();
    println!("{:#?}",to_lex2);
}
1 Like

In make_tokens, in each of the match arms, you make a copy of self using clone(). You then advance that copy. However, the original object is completely unchanged. The reason for this is that advance takes an owned self. The first argument to advance should be &mut self, which is a mutable borrow to a Lexer. Then you can avoid doing the clone and just call self.advance().

2 Likes

You're advancing a clone of self, and not advancing self. You then immediately discard the clone.

Other notes on the code follow.


This

    let mut to_lex = Lexer { text: input, pos: 0, current_char: ' '  };

In combination with

                match self.current_char {
                ' ' => self.clone().advance(),

is a logic error as you always skip the first character (or panic on empty input).


And as for the loop:

        while self.current_char != '\0' {

Rust doesn't use C strings. '\0' is a valid UTF8 code point that could be in the middle of your string, but more often, there is no '\0' at all.

Have advance return an Option<char>, or use an iterator, or something like that.


if DIGIT.find(self.current_char).is_some() {

if self.current_char.is_ascii_digit()


Here:

pub mod tokenz {
    pub static mut TOKENS: Vec<crate::Operator> = vec![]; 
}
// ...
                    unsafe { TOKENS.push(Operator::Plus) };

Just don't. static mut is so hard to use correctly that it may be deprecated in a future version. Create that Vec in make_tokens and pass around a &mut Vec<_> if you need to. Then return it.

5 Likes

Makes me wonder why the compiler gives no warning about self.current_char's new value being never read, in the implementation of advance.

I'm playing with the playground, and I assume you ended up with the clones due to this error:

  --> src/main.rs:49:24
   |
49 |                 ' ' => self.advance(),
   |                        ^^^^^---------
   |                        |    |
   |                        |    `*self` moved due to this method call
   |                        move occurs because `*self` has type `Lexer`, which does not implement the `Copy` trait
   |
help: you can `clone` the value and consume it, but this might not be your desired behavior
   |
49 |                 ' ' => self.clone().advance(),
   |                             ++++++++

The advice was indeed not your desired behavior. Here's the change you want:

-    pub fn advance(mut self) {
+    pub fn advance(&mut self) {

More later probably.

2 Likes

The playground doesn't seem able to make gists right now, but here's what I came up with without changing the overall structure of your code too much.

Much code hidden
#[derive(Debug)]
pub enum Operator {
    Plus,
    Minus,
    Division,
    Multiplication,
    Lparanthesis,
    Rparanthesis,
    Float(f64),
    Int(i64),
}

#[derive(Debug, Clone)]
pub struct Lexer {
    pub text: Vec<char>,
    pub pos: usize,
}

impl Lexer {
    pub fn new(text: Vec<char>) -> Self {
        Self { text, pos: 0 }
    }
    
    pub fn advance(&mut self) {
        self.pos += 1;
    }
    
    pub fn make_number(&mut self) -> Operator {
        let mut num_str = String::new();
        let mut dot_seen = false;
        
        let mut digit = self.text[self.pos];
        while digit.is_ascii_digit() || digit == '.' {
            if digit == '.' {
                if dot_seen {
                    break;
                }
                dot_seen = true;
            }
            
            num_str.push(digit);
            self.advance();
            digit = match self.text.get(self.pos).copied() {
                Some(c) => c,
                None => break,
            };
        }

        if dot_seen {
            Operator::Float(num_str.parse::<f64>().unwrap())
        } else {
            Operator::Int(num_str.parse::<i64>().unwrap())
        }
    }
    
    pub fn make_tokens(&mut self) -> Vec<Operator> {
        let mut tokens = Vec::new();
        while let Some(current_char) = self.text.get(self.pos).copied() {
            match current_char {
                ' ' => self.advance(),
                '+' => {
                    tokens.push(Operator::Plus);
                    self.advance();
                }
                '-' => {
                    tokens.push(Operator::Minus);
                    self.advance();
                }
                '*' => {
                    tokens.push(Operator::Multiplication);
                    self.advance();
                }
                '/' => {
                    tokens.push(Operator::Division);
                    self.advance();
                }
                '(' => {
                    tokens.push(Operator::Lparanthesis);
                    self.advance();
                }
                ')' => {
                    tokens.push(Operator::Rparanthesis);
                    self.advance();
                }
                digit if digit.is_ascii_digit() => {
                    tokens.push(self.make_number());
                }
                _ => {
                    self.advance();
                }
            }
        }
        tokens
    }
}

fn main() {
    println!("welcome to loxable shell, alpha version.");

    let input = "3.4 + 12";
    let input: Vec<char> = input.chars().collect();
    let mut to_lex = Lexer::new(input);
    let to_lex2 = to_lex.make_tokens();
    println!("{:#?}", to_lex2);
}

Briefly,

  • I removed the clone()s and advance takes &mut self
  • I create and return the Vec<Operator> as suggested above
  • I'm using is_ascii_digit as suggested above
  • I removed current_char; it's redundant with text and pos
  • I check for the end of input by using get(pos).copied() and checking for None instead
  • (minor) I replaced the dot count with dot_seen: bool
  • (minor) I made a constructor for Lexer

Untouched upon, but things to consider:

  • You can leave the input as a String and use things like chars or char_indices to iterate over individual characters. This may be more verbose in some ways, but on the other hand, you may be able to avoid reconstructing a String while parsing numbers or other multi-character constructs.

  • More generally, the lex-and-parse advice you'll find by following the links in the other thread will be better than anything I could manage to write up here

  • At some point, you're going to want error handling (for example you currently just skip unexpected characters). It is usually easier to build that in from the start than to revamp later, even if your errors are rubbish to begin with (due to wanting to focus on the happy path).

    // bare-bones types to be expanded "later"
    #[derive(Debug)]
    pub enum ErrorKind {
        Todo,
    }
    #[derive(Debug)]
    pub struct Error {
        kind: ErrorKind,
    }
    impl Error {
        fn todo() -> Self {
            Self { kind: ErrorKind::Todo }
        }
    }
    
          pub fn make_number(&mut self) -> Result<Operator, Error> {
              // ...
              if dot_seen {
                  let f = num_str.parse().map_err(|_| Error::todo())?;
                  Ok(Operator::Float(f))
              } else {
                  let i = num_str.parse().map_err(|_| Error::todo())?;
                  Ok(Operator::Int(i))
              }
          } 
          pub fn make_tokens(&mut self) -> Vec<Operator> -> Result<Operator, Error> {
              // ...
                      digit if digit.is_ascii_digit() => {
                          tokens.push(self.make_number()?); // <-- ?
                      }
                      _ => {
                          return Err(Error::todo());
                       }
                  }
              }
              Ok(tokens)
          }
    
4 Likes

I fixed my code, it works but all it does, is return an empty vector.
heres my new (also buggy and broken) code:


#[derive(Debug, Clone)]
pub enum Operator {
    Plus,
    Minus,
    Division,
    Multiplication,
    Lparanthesis,
    Rparanthesis,
    Float(f64),
    Int(i64),
}



#[derive(Debug, Clone)]
pub struct Lexer {
    pub text: Vec<char>,
    pub pos: usize,
    pub current_char: char,
}
impl Lexer {
    pub fn advance(&mut self) {
        self.pos += 1;
        self.current_char = self.text[self.pos];
    }
    pub fn make_number(&mut self) -> Operator {
        const DIGIT: &str = "0123456789";
        let mut num_str = String::new();
        let mut dot_count = 0;
        while DIGIT.find(self.current_char).is_some() || self.current_char == '.' {
            if self.current_char == '.' {
                if dot_count == 1 {
                    break;
                }
                dot_count += 1;
            } else {
                num_str.push(self.current_char);
                self.advance();
            }
        }
        if dot_count == 0 {
            return Operator::Int(num_str.parse::<i64>().unwrap())
        } else {
            return Operator::Float(num_str.parse::<f64>().unwrap())
        }

    }
    pub fn make_tokens(&mut self) -> Vec<crate::Operator> {
        let mut TOKENS: Vec<crate::Operator>  = vec![];
        const DIGIT: &str = "0123456789";
        while self.pos > self.text.len() {
                match self.current_char {
                ' ' => self.advance(),
                '+' => {
                    TOKENS.push(Operator::Plus);
                    self.advance();
                }
                '-' => {
                    TOKENS.push(Operator::Minus);
                    self.advance();
                }
                '*' => {
                    TOKENS.push(Operator::Multiplication);
                    self.advance();
                }
                '/' => {
                    TOKENS.push(Operator::Division);
                    self.advance();
                }
                '(' => {
                    TOKENS.push(Operator::Lparanthesis);
                    self.advance();
                }
                ')' => {
                    TOKENS.push(Operator::Rparanthesis);
                    self.advance();
                }
                _ => {
                    if DIGIT.find(self.current_char).is_some() {
                        self.make_number();
                    } else {
                        self.advance();
                    }
                }
            }
        }
        return TOKENS;
    }
}

use std::{io};
// use std::{fs, io, io::Read};

fn main() {
    /*
    println!("welcome to loxable shell, please input file name.");
    let mut filename = String::new();
    io::stdin().read_line(&mut filename).expect("not a valid input. (please dont use emojis or etc)");
    let filename = filename.trim();
    let mut args = String::new();
    let mut file = fs::File::open(filename).unwrap();
    file.read_to_string(&mut args).unwrap();
    dbg!(file);
    let args = args.trim();
    // println!("would you like to use untested/unstable version? (y/n)   \n");
    println!("{}",args);

    if args.starts_with("loxabla beta/") {
        println!("version type declared succesfully.");
    } else if !args.starts_with("loxable beta/") {
        panic!("current version of loxable is unavailable in this version, please upgrade your loxable version.");
    }
    */
    println!("welcome to loxable shell, alpha version.");

    let mut input = String::new();
    io::stdin().read_line(&mut input).expect("not a valid ascii string");
    let input = input.trim();
    let input: Vec<char> = input.chars().collect();
    let mut to_lex = Lexer { text: input, pos: 0, current_char: ' '  };
    let to_lex2 = to_lex.make_tokens();
    println!("{:#?}",to_lex2);
}

also btw, I reccomend using onlinegbd to test code instead of the rust playground, because it allows user input.

I also tried returning like this instead

return TOKENS.into_iter().collect(); 

while self.pos > self.text.len() is never going to be true, you probably meant to write < instead.

That's unnecessary.

I changed the code but it still dosent work, it panics because it goes out of bounds.


#[derive(Debug, Clone)]
pub enum Operator {
    Plus,
    Minus,
    Division,
    Multiplication,
    Lparanthesis,
    Rparanthesis,
    Float(f64),
    Int(i64),
}



#[derive(Debug, Clone)]
pub struct Lexer {
    pub text: Vec<char>,
    pub pos: usize,
    pub current_char: char,
}
impl Lexer {
    pub fn advance(&mut self) {
        self.pos += 1;
        self.current_char = self.text[self.pos];
    }
    pub fn make_number(&mut self) -> Operator {
        const DIGIT: &str = "0123456789";
        let mut num_str = String::new();
        let mut dot_count = 0;
        while DIGIT.find(self.current_char).is_some() || self.current_char == '.' {
            if self.current_char == '.' {
                if dot_count == 1 {
                    break;
                }
                dot_count += 1;
            } else {
                num_str.push(self.current_char);
                self.advance();
            }
        }
        if dot_count == 0 {
            return Operator::Int(num_str.parse::<i64>().unwrap())
        } else {
            return Operator::Float(num_str.parse::<f64>().unwrap())
        }

    }
    pub fn make_tokens(&mut self) -> Vec<crate::Operator> {
        let mut TOKENS: Vec<crate::Operator>  = vec![];
        const DIGIT: &str = "0123456789";
        while self.pos < self.text.len() {
                match self.current_char {
                ' ' => self.advance(),
                '+' => {
                    TOKENS.push(Operator::Plus);
                    self.advance();
                }
                '-' => {
                    TOKENS.push(Operator::Minus);
                    self.advance();
                }
                '*' => {
                    TOKENS.push(Operator::Multiplication);
                    self.advance();
                }
                '/' => {
                    TOKENS.push(Operator::Division);
                    self.advance();
                }
                '(' => {
                    TOKENS.push(Operator::Lparanthesis);
                    self.advance();
                }
                ')' => {
                    TOKENS.push(Operator::Rparanthesis);
                    self.advance();
                }
                _ => {
                    if DIGIT.find(self.current_char).is_some() {
                        self.make_number();
                    } else {
                        self.advance();
                    }
                }
            }
        }
        return TOKENS.into_iter().collect();
    }
}

use std::{io};
// use std::{fs, io, io::Read};

fn main() {
    /*
    println!("welcome to loxable shell, please input file name.");
    let mut filename = String::new();
    io::stdin().read_line(&mut filename).expect("not a valid input. (please dont use emojis or etc)");
    let filename = filename.trim();
    let mut args = String::new();
    let mut file = fs::File::open(filename).unwrap();
    file.read_to_string(&mut args).unwrap();
    dbg!(file);
    let args = args.trim();
    // println!("would you like to use untested/unstable version? (y/n)   \n");
    println!("{}",args);

    if args.starts_with("loxabla beta/") {
        println!("version type declared succesfully.");
    } else if !args.starts_with("loxable beta/") {
        panic!("current version of loxable is unavailable in this version, please upgrade your loxable version.");
    }
    */
    println!("welcome to loxable shell, alpha version.");

    let mut input = String::new();
    io::stdin().read_line(&mut input).expect("not a valid ascii string");
    let input = input.trim();
    let input: Vec<char> = input.chars().collect();
    let mut to_lex = Lexer { text: input, pos: 0, current_char: ' '  };
    let to_lex2 = to_lex.make_tokens();
    println!("{:#?}",to_lex2);
}

my rust lexer keeps returning an empty vector. (heres my code)

#[derive(Debug, Clone)]
pub enum Operator {
    Plus,
    Minus,
    Division,
    Multiplication,
    Lparanthesis,
    Rparanthesis,
    Float(f64),
    Int(i64),
}



#[derive(Debug, Clone)]
pub struct Lexer {
    pub text: Vec<char>,
    pub pos: usize,
    pub current_char: char,
}
impl Lexer {
    pub fn advance(&mut self) {
        self.pos += 1;
        self.current_char = self.text[self.pos];
    }
    pub fn make_number(&mut self) -> Operator {
        const DIGIT: &str = "0123456789";
        let mut num_str = String::new();
        let mut dot_count = 0;
        while DIGIT.find(self.current_char).is_some() || self.current_char == '.' {
            if self.current_char == '.' {
                if dot_count == 1 {
                    break;
                }
                dot_count += 1;
            } else {
                num_str.push(self.current_char);
                self.advance();
            }
        }
        if dot_count == 0 {
            return Operator::Int(num_str.parse::<i64>().unwrap())
        } else {
            return Operator::Float(num_str.parse::<f64>().unwrap())
        }

    }
    pub fn make_tokens(&mut self) -> Vec<crate::Operator> {
        let mut TOKENS: Vec<crate::Operator>  = vec![];
        const DIGIT: &str = "0123456789";
        while self.pos >= self.text.len() {
                match self.current_char {
                ' ' => self.advance(),
                '+' => {
                    TOKENS.push(Operator::Plus);
                    self.advance();
                }
                '-' => {
                    TOKENS.push(Operator::Minus);
                    self.advance();
                }
                '*' => {
                    TOKENS.push(Operator::Multiplication);
                    self.advance();
                }
                '/' => {
                    TOKENS.push(Operator::Division);
                    self.advance();
                }
                '(' => {
                    TOKENS.push(Operator::Lparanthesis);
                    self.advance();
                }
                ')' => {
                    TOKENS.push(Operator::Rparanthesis);
                    self.advance();
                }
                _ => {
                    if DIGIT.find(self.current_char).is_some() {
                        self.make_number();
                    } else {
                        self.advance();
                    }
                }
            }
        }
        return TOKENS.into_iter().collect();
    }
}

use std::{io};
// use std::{fs, io, io::Read};

fn main() {
    /*
    println!("welcome to loxable shell, please input file name.");
    let mut filename = String::new();
    io::stdin().read_line(&mut filename).expect("not a valid input. (please dont use emojis or etc)");
    let filename = filename.trim();
    let mut args = String::new();
    let mut file = fs::File::open(filename).unwrap();
    file.read_to_string(&mut args).unwrap();
    dbg!(file);
    let args = args.trim();
    // println!("would you like to use untested/unstable version? (y/n)   \n");
    println!("{}",args);

    if args.starts_with("loxabla beta/") {
        println!("version type declared succesfully.");
    } else if !args.starts_with("loxable beta/") {
        panic!("current version of loxable is unavailable in this version, please upgrade your loxable version.");
    }
    */
    println!("welcome to loxable shell, alpha version.");

    let mut input = String::new();
    io::stdin().read_line(&mut input).expect("not a valid ascii string");
    let input = input.trim();
    let input: Vec<char> = input.chars().collect();
    let mut to_lex = Lexer { text: input, pos: 0, current_char: ' '  };
    let to_lex2 = to_lex.make_tokens();
    println!("{:#?}",to_lex2);
}

Shouldn't this be <=?

1 Like

that dosent work I tried it heres my code.

#[derive(Debug, Clone)]
pub enum Operator {
    Plus,
    Minus,
    Division,
    Multiplication,
    Lparanthesis,
    Rparanthesis,
    Float(f64),
    Int(i64),
}



#[derive(Debug, Clone)]
pub struct Lexer {
    pub text: Vec<char>,
    pub pos: usize,
    pub current_char: char,
}
impl Lexer {
    pub fn advance(&mut self) {
        self.pos += 1;
        self.current_char = self.text[self.pos];
    }
    pub fn make_number(&mut self) -> Operator {
        const DIGIT: &str = "0123456789";
        let mut num_str = String::new();
        let mut dot_count = 0;
        while DIGIT.find(self.current_char).is_some() || self.current_char == '.' {
            if self.current_char == '.' {
                if dot_count == 1 {
                    break;
                }
                dot_count += 1;
            } else {
                num_str.push(self.current_char);
                self.advance();
            }
        }
        if dot_count == 0 {
            return Operator::Int(num_str.parse::<i64>().unwrap())
        } else {
            return Operator::Float(num_str.parse::<f64>().unwrap())
        }

    }
    pub fn make_tokens(&mut self) -> Vec<crate::Operator> {
        let mut TOKENS: Vec<crate::Operator>  = vec![];
        const DIGIT: &str = "0123456789";
        while self.pos <= self.text.len() {
                match self.current_char {
                ' ' => self.advance(),
                '+' => {
                    TOKENS.push(Operator::Plus);
                    self.advance();
                }
                '-' => {
                    TOKENS.push(Operator::Minus);
                    self.advance();
                }
                '*' => {
                    TOKENS.push(Operator::Multiplication);
                    self.advance();
                }
                '/' => {
                    TOKENS.push(Operator::Division);
                    self.advance();
                }
                '(' => {
                    TOKENS.push(Operator::Lparanthesis);
                    self.advance();
                }
                ')' => {
                    TOKENS.push(Operator::Rparanthesis);
                    self.advance();
                }
                _ => {
                    if DIGIT.find(self.current_char).is_some() {
                        self.make_number();
                    } else {
                        self.advance();
                    }
                }
            }
        }
        return TOKENS.into_iter().collect();
    }
}

use std::{io};
// use std::{fs, io, io::Read};

fn main() {
    /*
    println!("welcome to loxable shell, please input file name.");
    let mut filename = String::new();
    io::stdin().read_line(&mut filename).expect("not a valid input. (please dont use emojis or etc)");
    let filename = filename.trim();
    let mut args = String::new();
    let mut file = fs::File::open(filename).unwrap();
    file.read_to_string(&mut args).unwrap();
    dbg!(file);
    let args = args.trim();
    // println!("would you like to use untested/unstable version? (y/n)   \n");
    println!("{}",args);

    if args.starts_with("loxabla beta/") {
        println!("version type declared succesfully.");
    } else if !args.starts_with("loxable beta/") {
        panic!("current version of loxable is unavailable in this version, please upgrade your loxable version.");
    }
    */
    println!("welcome to loxable shell, alpha version.");

    let mut input = String::new();
    io::stdin().read_line(&mut input).expect("not a valid ascii string");
    let input = input.trim();
    let input: Vec<char> = input.chars().collect();
    let mut to_lex = Lexer { text: input, pos: 0, current_char: ' '  };
    let to_lex2 = to_lex.make_tokens();
    println!("{:#?}",to_lex2);
}

I still have an error that nobody can fix, I have tried using <= but it dosent work, can someone please tell me why its returning an empty vector, and panicking when I use <= instead of >=.

I also tried using a vec of type String instead of type Operator

like this (heres my code):

#[derive(Debug, Clone)]
pub enum Operator {
    Plus,
    Minus,
    Division,
    Multiplication,
    Lparanthesis,
    Rparanthesis,
    Float(f64),
    Int(i64),
}



#[derive(Debug, Clone)]
pub struct Lexer {
    pub text: Vec<char>,
    pub pos: usize,
    pub current_char: char,
}
impl Lexer {
    pub fn advance(&mut self) {
        self.pos += 1;
        self.current_char = self.text[self.pos];
    }
    pub fn make_number(&mut self) -> String {
        let mut num_str = String::new();
        let mut dot_count = 0;
        while self.current_char.is_ascii_digit() || self.current_char == '.' {
            if self.current_char == '.' {
                if dot_count == 1 {
                    break;
                }
                dot_count += 1;
            } else {
                num_str.push(self.current_char);
                self.advance();
            }
        }
        if dot_count == 0 {
            return num_str;
        } else {
            return num_str;
        }

    }
    pub fn make_tokens(&mut self) -> Vec<String> {
        let mut TOKENS: Vec<String>  = vec![];
        while self.pos >= self.text.len() {
                match self.current_char {
                ' ' => self.advance(),
                '+' => {
                    TOKENS.push("plus".to_string());
                    self.advance();
                }
                '-' => {
                    TOKENS.push("minus".to_string());
                    self.advance();
                }
                '*' => {
                    TOKENS.push("Multiplication".to_string());
                    self.advance();
                }
                '/' => {
                    TOKENS.push("division".to_string());
                    self.advance();
                }
                '(' => {
                    TOKENS.push("Lparanthesis".to_string());
                    self.advance();
                }
                ')' => {
                    TOKENS.push("Rparanthesis".to_string());
                    self.advance();
                }
                _ => {
                    if self.current_char.is_ascii_digit()  {
                        self.make_number();
                    } else {
                        self.advance();
                    }
                }
            }
        }
        return TOKENS.into_iter().collect();
    }
}

use std::{io};
// use std::{fs, io, io::Read};

fn main() {
    /*
    println!("welcome to loxable shell, please input file name.");
    let mut filename = String::new();
    io::stdin().read_line(&mut filename).expect("not a valid input. (please dont use emojis or etc)");
    let filename = filename.trim();
    let mut args = String::new();
    let mut file = fs::File::open(filename).unwrap();
    file.read_to_string(&mut args).unwrap();
    dbg!(file);
    let args = args.trim();
    // println!("would you like to use untested/unstable version? (y/n)   \n");
    println!("{}",args);

    if args.starts_with("loxabla beta/") {
        println!("version type declared succesfully.");
    } else if !args.starts_with("loxable beta/") {
        panic!("current version of loxable is unavailable in this version, please upgrade your loxable version.");
    }
    */
    println!("welcome to loxable shell, alpha version.");

    let mut input = String::new();
    io::stdin().read_line(&mut input).expect("not a valid ascii string");
    let input = input.trim();
    let input: Vec<char> = input.chars().collect();
    let mut to_lex = Lexer { text: input, pos: 0, current_char: ' '  };
    let to_lex2 = to_lex.make_tokens();
    println!("{:#?}",to_lex2);
}

AND IT'S STILL RETURNING AN EMPTY VECTOR

I managed to get it working but their are a few errors


#[derive(Debug)]
pub enum Operator {
    Plus,
    Minus,
    Division,
    Multiplication,
    Lparanthesis,
    Rparanthesis,
    Float(f64),
    Int(i64),
}



#[derive(Debug, Clone)]
pub struct Lexer {
    pub text: Vec<char>,
    pub pos: usize,
    pub current_char: char,
}
impl Lexer {

    pub fn advance(&mut self) {
        self.pos += 1;
        if self.pos < self.text.len() {
            self.current_char = self.text[self.pos];
        } else {
            self.current_char = ' '; // Set current_char to null character when the end of the text is reached
        }
    }

    pub fn make_number(&mut self) -> Operator {
        const DIGIT: &str = "0123456789";
        let mut num_str = String::new();
        let mut first_char_dot = false;
        while DIGIT.find(self.current_char).is_some() || (self.current_char == '.' && !first_char_dot) {
            if self.current_char == '.' {
                first_char_dot = true;
            }
            num_str.push(self.current_char);
            self.advance(); // Advance the position here
        }
        if !first_char_dot {
            return Operator::Int(num_str.parse::<i64>().unwrap())
        } else {
            return Operator::Float(num_str.parse::<f64>().unwrap())
        }
    }

    pub fn make_tokens(&mut self) -> Vec<Operator> {
        let mut tokens: Vec<Operator> = vec![];
        const DIGIT: &str = "0123456789";
        while self.pos < self.text.len() {
                match self.current_char {
                ' ' => self.advance(),
                '+' => {
                    tokens.push(Operator::Plus);
                    self.advance();
                }
                '-' => {
                    tokens.push(Operator::Minus);
                    self.advance();
                }
                '*' => {
                    tokens.push(Operator::Multiplication);
                    self.advance();
                }
                '/' => {
                    tokens.push(Operator::Division);
                    self.advance();
                }
                '(' => {
                    tokens.push(Operator::Lparanthesis);
                    self.advance();
                }
                ')' => {
                    tokens.push(Operator::Rparanthesis);
                    self.advance();
                }
                '.' => {
                    self.advance();
                }
                _ => {
                    if DIGIT.find(self.current_char).is_some() {
                        tokens.push(self.make_number());
                    } else {
                        self.advance();
                    }
                }
            }
        }
        return tokens;
    }
}

see if the user inputs for example: 1+2 it does not register the first int.

Well, you can, if you pay attention to how your code works.

First of all, here's the self-contained playground - there's the code from your previous post, with input replaced by the constant string; running this code indeed is

Now, the TOKENS are filled only inside the while loop (directly or indirectly). Let's check if this loop ever runs at all, first, by adding a println! at its beginning - updated playground. As you can see, no In loop messages are printed, i.e. the loop isn't executed.

Therefore, loop condition is never satisfied. Let's check it explicitly with another println - or, even better, dbg; next playground outputs the following:

[src/main.rs:34] self.pos = 0
[src/main.rs:34] self.text.len() = 15
[src/main.rs:34] self.pos >= self.text.len() = false

So, the condition is indeed not satisfied, since self.pos is initially less then self.text.len().
Now, what condition should be there, logically? Well, as the lexer advances, the position starts from zero and scans through the text; therefore, it should always be less then the text's length, not more.

Therefore, we should flip this condition over and see what happens. After that, the playground panics, and backtrace has the following line:

   6: playground::Lexer::advance
             at ./src/main.rs:10:29

That is, the panic is inside advance. Let's check, what values do all the corresponding properties hold when this happens. Next playground, with the input shortened for convenience, prints the following:

In `advance`, before indexing:
[src/main.rs:11] self.pos = 1
[src/main.rs:11] self.text.len() = 1
thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', src/main.rs:12:29

That is, you're attempting to get the second character (the character with index 1, with index being zero-based) from the string where there's only one character.

Now, we can think that we simply have to make the loop condition stricter - to have self.pos < self.text.len() - 1. Let's try this in next playground, with a little synthetic input - here I'm cutting the corners a little to show you the point faster.
This playground doesn't panic, but outputs the following:

[
    "plus",
]

...while the input is "(+)", which is clearly three tokens, not one.

To see what happens, let's again add debugging information. Now, we'll look at the current_char being matched on each iteration, and this playground prints the following:

[src/main.rs:35] self.current_char = ' '
[src/main.rs:35] self.current_char = '+'

That is, lexer indeed doesn't see either the first or the last character.

The only place where current_char changes is inside advance. Let's see how it is called - this playground outputs a relatively helpful info:

[src/main.rs:39] self.current_char = ' '
Begin `advance`
[src/main.rs:10] self.pos = 0
[src/main.rs:10] self.current_char = ' '
End `advance`
[src/main.rs:14] self.pos = 1
[src/main.rs:14] self.current_char = '+'
[src/main.rs:39] self.current_char = '+'
Begin `advance`
[src/main.rs:10] self.pos = 1
[src/main.rs:10] self.current_char = '+'
End `advance`
[src/main.rs:14] self.pos = 2
[src/main.rs:14] self.current_char = ')'

So:

  • At the first iteration, current_char is space, therefore we call advance immediately. It increments pos and reads self.text[self.pos]. Note that the first character is never being read - current_char is immediately set to the second one.
  • At the second iteration, we process the + sign and call advance again, to read the third character.
  • After that, the loop condition is no longer satisfied, so the third character is never matched.

Let's focus on the first step. We want to somehow have the first character of text be read into current_char, so that it can be matched on. Currently, it doesn't happen, since the position is incremented before reading. But what we can do is to swap these operations - to read the value at current position, and then increment it. Next playground indeed shows, that it's the first token that's picked up after this swap, and not the second one.

Next, we can check why the first token is the only one to show in the output. Let's check the loop condition at the end of each iteration, to see when the loop ends; next playground gives us the following:

[src/main.rs:69] self.pos = 1
[src/main.rs:69] self.text.len() - 1 = 2
[src/main.rs:69] self.pos < self.text.len() - 1 = true
[src/main.rs:69] self.pos = 2
[src/main.rs:69] self.text.len() - 1 = 2
[src/main.rs:69] self.pos < self.text.len() - 1 = false

That is:

  • On each iteration, self.pos is pointing on the next character. Therefore, on the last iteration, it should be pointing to the "place after the last character".
  • The condition, however, stops it not one, but two steps earlier - that's why we have two missing characters at the end.

We could try and just change the condition to stop when pos is "after the last character" - but in this case we would again get the panic, since in any branch we would call advance, which would unconditionally index into the self.text, with the index being out of bounds. To solve this part of problem, we should step back and ask ourselves, why do we need the current_char at the first place? It is always being set to self.text[self.pos], and self.pos is always being incremented at the same time, so can't we just... use self.text[self.pos] explicitly?

Let's try this idea. We just replace every usage of current_char and comment out the line setting it inside advance. Next playground has this change applied, together with the reverted change in loop condition, and...

[
    "Lparanthesis",
    "plus",
    "Rparanthesis",
]

Yeah, it seems we've got it!

9 Likes

(Written before your latest sort-of working post.)

You still have the logic error where you initialize current_char to ' ' and then immediately advance to position 1 when you call make_tokens.

More generally, eventually you're going to advance past your input:

[ i ][ n ][ p ][ u ][ t ][ 💥 ][ 💥 ][ 💥 ][ 💥 ][ 💥 ][ 💥 ]
  0    1    2    3    4    ^ illegal to read this or anything past it

The way you know you've read all the input in your current code is when pos > text.len(); in the example above, when pos > 4, i.e. when pos becomes 5. When this happens you can't read the chracter at position 5. So you need to add some sort of check to this function:

    pub fn advance(&mut self) {
        self.pos += 1;
        self.current_char = self.text[self.pos];
    }

Because eventually you're going to get self.pos == 5 -- that's how you know you're done -- and when that happens, you can't read self.text[self.pos].

Or avoid it like I did by just not reading the character preemptively.

2 Likes

thanks to @quinedot and @Cerber-Ursi for helping me solve this problem, both of your respones were the most helpful! This post will surely help many others along the way, considering their is a lot of code, and responses to it, and many different styles of writing code. Anyways if you are someone reading this post and want to acheive a lexer that returns an enum intead of an string heres my code:

use std::num::ParseIntError;
use std::result::Result;

#[derive(Debug)]
pub enum Operator {
    Plus,
    Minus,
    Division,
    Multiplication,
    Lparanthesis,
    Rparanthesis,
    Float(f64),
    Int(i128),
}



#[derive(Debug, Clone)]
pub struct Lexer {
    pub text: Vec<char>,
    pub pos: usize,
    pub current_char: char,
}
impl Lexer {

    pub fn parse_number(input: &str) -> Result<u8, ParseIntError> {
        input.trim_end().parse()
    }

    pub fn advance(&mut self) {
        if self.pos < self.text.len() {
            self.pos += 1;
        }
    }

    pub fn make_number(&mut self) -> Operator {
        let mut num_str = String::new();
        let mut dot_count = 0;

        while self.pos < self.text.len() && (self.text[self.pos].is_ascii_digit() || self.text[self.pos] == '.') {
            let c = self.text[self.pos];
            if c == '.' {
                if dot_count == 1 {
                    break;
                }
                dot_count += 1;
            }
            num_str.push(c);
            self.advance();
        }
        if dot_count == 0 {
            let num_str: i128 = num_str.trim_start_matches(' ').trim_end_matches(' ').parse::<i128>().expect("Not a i128 (integer)").into();
            return Operator::Int(num_str);
        } else {
            let num_str: f64 = num_str.trim_start_matches(' ').trim_end_matches(' ').parse::<f64>().expect("Not a f64 (float/decimal)").into();
            return Operator::Float(num_str);
        }
    }

    pub fn make_tokens(&mut self) -> Vec<Operator> {
        let mut tokens: Vec<Operator> = vec![];
        while self.pos < self.text.len() {
                match self.text[self.pos] {
                ' ' => self.advance(),
                '+' => {
                    tokens.push(Operator::Plus);
                    self.advance();
                }
                '-' => {
                    tokens.push(Operator::Minus);
                    self.advance();
                }
                '*' => {
                    tokens.push(Operator::Multiplication);
                    self.advance();
                }
                '/' => {
                    tokens.push(Operator::Division);
                    self.advance();
                }
                '(' => {
                    tokens.push(Operator::Lparanthesis);
                    self.advance();
                }
                ')' => {
                    tokens.push(Operator::Rparanthesis);
                    self.advance();
                }
                '.' => {
                    self.advance();
                }
                _ => {
                    if self.text[self.pos].is_ascii_digit() {
                        tokens.push(self.make_number());
                    } else {
                        self.advance();
                    }
                }
            }
        }
        return tokens.into_iter().collect();
    }
}

I will say, writing a lexer manually is not quite difficult, in the sense of the code needed being complex, but it is quite unintuitive. Heavy use of unit tests to validate that you're getting the output you're expecting for a lot of edge cases (including error cases!) is basically essential to prevent going mad, especially as you revisit it to add new tokens.

While it's a decent enough way to learn how to use a language, you should probably just grab a crate when you're trying to get something working when writing Rust: the support is very strong.

Nowadays I just grab pest when I need to parse some relatively conventional syntax. It's not very good at things like lexical whitespace (it's possible, but clumsy), but if you can describe it using the traditional BNF and regex token approach, it's probably way simpler as what pest implements, a parsing expression grammar (PEG), something like a mix of the best of both. It has a few quirks (needing to wrap the top rule in SOI/EOI, the odd way it provides operator precedence support), but it makes the code quite straightforward to implement once you've got a handle on the approach.

For lower level parsing my go-to is still nom, a somewhat clumsy to use tool (while Rust doesn't have impl trait statics, at least), but extremely flexible and reliable.

It's no surprise that both are at the top of Parser tooling — list of Rust libraries/crates // Lib.rs - you may want to take a look through those and see what you like.

5 Likes

I would generally agree; I have the feeling that OP was running into so many errors because s/he is making his/her own life unnecessary difficult. The code is not very idiomatic, and it's trying to force common misconceptions (e.g. treating a string as an "array of characters") onto the language, which results in fighting with the idioms and APIs.

The presented simple language can be lexed much more simply and idiomatically: Rust Explorer

1 Like

to be honest I'm new to rust and I ran into tons of errors trying to do it any other way I did, following someone elses code would not teach me much, and I would not understand it as well, plus, this code works just fine.