Hello everyone,
I'm building a lexer/parser for fun and I have a few questions. I'm not a developer and I didn't even know what Unicode was until a few hours ago, so please correct me if I say something wrong!
I'm quite new to the concept of parsing text (and to Rust!) and I'm struggling with which route to go down in terms of Unicode support.
I've implemented a Lexer struct like this:
pub struct Lexer<'source> {
/// Position within source.
line: usize,
/// Current character.
ch: Option<char>,
/// Iterator over source.
iter: Chars<'source>,
}
fn advance(&mut self) {
let next = self.iter.next();
if next == Some('\n') {
self.line = self.line + 1;
}
self.ch = next;
}
...
<snip>
...
I wrote Lexer to track my position vertically by incrementing the "line" property when I encounter a newline, but instead of taking the easy way out I'd like to fully understand how I can accurately track the column I am analyzing as well.
So, correct me if I am wrong, but I believe a single "visual unit" is known as a grapheme (or grapheme cluster?), and a grapheme can be composed of several unicode code points. I also believe that the Chars type will iterate over these code points, meaning that if I increment an index value on my Lexer every time I call .next(), the index will inevitably grow to be inaccurate. For example:
let mut lexer = Lexer::new("👨👩👦")
// lexer.index might be 0 here
lexer.next()
// index reads 1
lexer.next()
// index reads 2, even though we are still "visually" on character 0 within the source.
And so using the index value in an error message later will not provide a useful error message if I implement it this way.
I did some research and found the unicode-segmentation crate, which seems to allow me to create an iterator which will iterate over graphemes instead of code points. I think that would solve the problem. However, I noticed that it would require me to refactor a lot of code, the grapheme iterator seems to return &str instead of char for example, and so I thought I would reach out here to see if this is a common thing that is best solved another way. Am I thinking about this wrong?
Once again, I'm not a developer and I don't even work in IT, so feel free to talk to me like a child. Thanks for reading!