Tracking position in unicode-enabled lexer, best practice?

Hello everyone,

I'm building a lexer/parser for fun and I have a few questions. I'm not a developer and I didn't even know what Unicode was until a few hours ago, so please correct me if I say something wrong!

I'm quite new to the concept of parsing text (and to Rust!) and I'm struggling with which route to go down in terms of Unicode support.

I've implemented a Lexer struct like this:

pub struct Lexer<'source> {
    /// Position within source.
    line: usize,
    /// Current character.
    ch: Option<char>,
    /// Iterator over source.
    iter: Chars<'source>,
}

    fn advance(&mut self) {
        let next = self.iter.next();
        if next == Some('\n') {
            self.line = self.line + 1;
        }
        self.ch = next;
    }

  ...
  <snip>
  ...

I wrote Lexer to track my position vertically by incrementing the "line" property when I encounter a newline, but instead of taking the easy way out I'd like to fully understand how I can accurately track the column I am analyzing as well.

So, correct me if I am wrong, but I believe a single "visual unit" is known as a grapheme (or grapheme cluster?), and a grapheme can be composed of several unicode code points. I also believe that the Chars type will iterate over these code points, meaning that if I increment an index value on my Lexer every time I call .next(), the index will inevitably grow to be inaccurate. For example:

let mut lexer = Lexer::new("👨‍👩‍👦")
// lexer.index might be 0 here
lexer.next()
// index reads 1
lexer.next()
// index reads 2, even though we are still "visually" on character 0 within the source.

And so using the index value in an error message later will not provide a useful error message if I implement it this way.

I did some research and found the unicode-segmentation crate, which seems to allow me to create an iterator which will iterate over graphemes instead of code points. I think that would solve the problem. However, I noticed that it would require me to refactor a lot of code, the grapheme iterator seems to return &str instead of char for example, and so I thought I would reach out here to see if this is a common thing that is best solved another way. Am I thinking about this wrong?

Once again, I'm not a developer and I don't even work in IT, so feel free to talk to me like a child. Thanks for reading!

No, you've got that pretty much right. Graphemes have to be &str because they can consist of an arbitrary number of code points. unicode-segmentation is probably the crate you want for graphemes.

However, if I understand what you're trying to accomplish -- tracking column position in a terminal or terminal-like display -- there's no reliable, practical solution. As far as I'm aware, there's no algorithm for grapheme column with -- and no guarantee that different displays will give them the same width. Moreover, the definition of graphemes has changed between Unicode versions, which is why unicode-segmentation has two modes.

There is an official algorithm for column width of a given code point. But it also has two modes depending on your context, and doesn't work on the grapheme level, e.g. after character combining.

For example, I copy and pasted your family emoji into this playground. On my browser, the behavior I see is

  • In the code, the emoji got blow up to display as three faces with two joiners between them
    • Each face takes two columns and the joiner is displayed, and thus takes one column for a total of 8 columns
  • In the output, the emoji is "properly" combined into one character
    • It takes 2 columns
  • The unicode-width algorithms both say the width is 6 (two for the faces and zero for each joiner, presumably)
  • If I modify it to throw an error, the faces are printed individually, and the caret below which is supposed to point at the extra ; is too far right

And based on previous discussions... you may even see different behavior in your own browser.

2 Likes

If you are specifically writing to an ANSI terminal, you can discover widths by writing text and then querying the cursor position. This is slow (since it requires waiting for round-trip communication) but you can memoize the results.

I imagine that @jokocide just wants to produce column numbers for errors/spans in an application that doesn't know anything about the text rendering, though. The best available answer there is to use unicode-width and live with the possibility of error — unless you want to ask the user what text-editor they use and customize to that!

2 Likes

Another potential option is to report column offsets in terms of UTF8 bytes, if it makes sense for the application (e.g. an editor that also displays byte offsets, and your input is always UTF8).

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.