How to get entire utf8 char?

Hi,

I compare two String char by char.
There are utf8 string.

In the documentation, for char_indices function on String, we have this example to access the char at some index.
But in fact, it is not the entire char... it is some part of the encoded char.

let yes = "y̆es";

let mut char_indices = yes.char_indices();

assert_eq!(Some((0, 'y')), char_indices.next()); // not (0, 'y̆')
assert_eq!(Some((1, '\u{0306}')), char_indices.next());

 // note the 3 here - the previous character took up two bytes
assert_eq!(Some((3, 'e')), char_indices.next());
assert_eq!(Some((4, 's')), char_indices.next());

assert_eq!(None, char_indices.next());

In my code, I try to find the first char that differ between two string and print a part of the string to see the difference [some char before if possible and some char after if possible].
But how to get the entire char ?
I want to get the entire char 'y̆', for exemple knowing just its index...

You are probably looking after graphemes.

2 Likes

That's not possible because Unicode doesn't work that way. is combination of two codepoints.

There are crates that may help you to get the more human-like things from string, like unicode_segmentation.

But you really need to read some kind of documentation about Unicode if you want to go that way. Believe me, that rabbit hole is pretty damn deep.

5 Likes

Thank you very much. I was thinking there were tools in String. I will read that. Thank you to mention these features.

yeah the issue you are having is actually about graphenes, 1 utf8 graphene like y̆ can be more than one codepoint(also often called char) in this case its y and \u{0306} which is a modifier

2 Likes

Basically: the story there is that many operations that you may want and that are feeling “simple” become very tricky and convoluted with Unicode.

Every single damn thing that you may imagine becomes a pile of tables that you need to consult and apply.

It's also why UTF-8 is proper representation of Unicode string: the advantage that other representation have, direct access to codepoint in a string… have zero applications to real-world algorithm if you want to support flags, emoji and the whole zoo that humanity invented.

Thus arguments against UTF-8 are sounding more of “give us UTF-32 or UTF-16 so it would be easier to write broken and incorrect code”… when phrased like that choice that Rust did becomes obvious: basic operations on code points are in std, but operations that require tables that are tens of megabytes in size (I'm not joking!) don't belong to std, you need to decide what to do about them on case-by-case basis.

3 Likes

The Rust standard library includes tools for working with text data byte-by-byte, or codepoint-by-codepoint. Those can be implemented easily and with minimal cultural data requirements because they're "just numbers." Those approaches work great for dealing with data storage, retrieval, and transmission, and for converting between encodings.

The problem you're trying to solve, of identifying the thing a person reading the text would consider to be a single letter, is not well-supported by the Rust standard library. Frankly, that's probably a reasonable choice: recognizing character divisions requires some extensive and frequently-updated data, which would make it a poor match for Rust's release cadence. Instead, that capability lives in crates like this one.

I would also encourage you to read up on Unicode normalization forms - your test string has at least two distinct representations as sequences of codepoints, and your test results will depend on which representation you use. (They are, respectively, a codepoint representing a y followed by a codepoint representing a combining mark tilde accent, and a single codepoint representing y with a tilde accent.)

6 Likes