Hi all,
I'm in the process of learning Rust, and I try to build a (basic) Lexer for the .NET / C# programming language. The goal of it is to detect whether or not, multiple line terminators occurs right after each other.
Here's the solution I came up with:
// Defines the lexical tokens of the .NET / C# programming language.
#[derive(Debug, PartialEq)]
enum Token {
// Lexical tokens with a "special" meaning.
Identifier,
LineTerminator,
EndOfFile,
}
struct Lexer<'a> {
chars: std::iter::Peekable<std::str::Chars<'a>>,
}
// Defines the "basic" implementation of the `Lexer` struct.
impl<'a> Lexer<'a> {
// Constants which defines the unicode code points of specific characters.
const UNICODE_OF_CARRIAGE_RETURN_CHAR: char = '\u{000D}';
const UNICODE_OF_LINE_FEED_CHAR: char = '\u{000A}';
const UNICODE_OF_NEXT_LINE_CHAR: char = '\u{0085}';
const UNICODE_OF_LINE_SEPARATOR_CHAR: char = '\u{2028}';
const UNICODE_OF_PARAGRAPH_SEPARATOR_CHAR: char = '\u{2029}';
// Initializes a new instance of the `Lexer` struct which operates on `input`.
fn new(input: &'a str) -> Self {
Self {
chars: input.chars().peekable(),
}
}
// Process the input and returns the next token.
fn next_token(&mut self) -> Token {
match self.chars.next() {
Some(ch) => match ch {
Lexer::UNICODE_OF_CARRIAGE_RETURN_CHAR => match self.chars.peek() {
Some(&Lexer::UNICODE_OF_LINE_FEED_CHAR) => {
self.chars.next();
Token::LineTerminator
}
_ => Token::LineTerminator,
},
Lexer::UNICODE_OF_LINE_FEED_CHAR
| Lexer::UNICODE_OF_NEXT_LINE_CHAR
| Lexer::UNICODE_OF_LINE_SEPARATOR_CHAR
| Lexer::UNICODE_OF_PARAGRAPH_SEPARATOR_CHAR => Token::LineTerminator,
_ => Token::Identifier,
},
None => Token::EndOfFile,
}
}
}
Right now, this is already working. The next thing I tried was extending the Token enumeration to contain the position where this token was found.
This has been done by adding a Position struct:
// Defines a position (row, column) in a .NET / C# source code file.
#[derive(Debug, PartialEq)]
struct Position {
row: u16,
column: u16,
}
Next, I extended the Lexer struct to include this Position.
struct Lexer<'a> {
chars: std::iter::Peekable<std::str::Chars<'a>>,
pos: Position,
}
And finally, I updated the Lexer to include the position when fetching the next token:
// Defines the "basic" implementation of the `Lexer` struct.
impl<'a> Lexer<'a> {
// Constants which defines the unicode code points of specific characters.
const UNICODE_OF_CARRIAGE_RETURN_CHAR: char = '\u{000D}';
const UNICODE_OF_LINE_FEED_CHAR: char = '\u{000A}';
const UNICODE_OF_NEXT_LINE_CHAR: char = '\u{0085}';
const UNICODE_OF_LINE_SEPARATOR_CHAR: char = '\u{2028}';
const UNICODE_OF_PARAGRAPH_SEPARATOR_CHAR: char = '\u{2029}';
// Initializes a new instance of the `Lexer` struct which operates on `input`.
fn new(input: &'a str) -> Self {
Self {
chars: input.chars().peekable(),
pos: Position { row: 0, column: 0 },
}
}
// Process the input and returns the next token.
fn next_token(&mut self) -> Token {
match self.chars.next() {
Some(ch) => match ch {
Lexer::UNICODE_OF_CARRIAGE_RETURN_CHAR => match self.chars.peek() {
Some(&Lexer::UNICODE_OF_LINE_FEED_CHAR) => {
self.chars.next();
Token::LineTerminator(self.pos)
}
_ => Token::LineTerminator(self.pos),
},
Lexer::UNICODE_OF_LINE_FEED_CHAR
| Lexer::UNICODE_OF_NEXT_LINE_CHAR
| Lexer::UNICODE_OF_LINE_SEPARATOR_CHAR
| Lexer::UNICODE_OF_PARAGRAPH_SEPARATOR_CHAR => Token::LineTerminator(self.pos),
_ => Token::Identifier(self.pos),
},
None => Token::EndOfFile(self.pos),
}
}
}
And there the adventure with the borrow checker does begin:
Here's one of the errors that are returned when running cargo check:
So, CARGO advices to implement the Copy trait on the Position struct, which is easy enough:
// Defines a position (row, column) in a .NET / C# source code file.
#[derive(Debug, PartialEq, Clone, Copy)]
struct Position {
row: u16,
column: u16,
}
But it feels like it's not the Rust way of doing things.
I make a copy of a struct, which only contains 2 u16 fields, so I don't expect an issue here, but it feels like there should be another way, without the requirement to copy the Position struct.
Any advice?
