Iterating over non-ASCII ranges

Iterating over a range of characters is easy e.g. A-Z can be generated using

let chars: Vec<_> = ('A'..='Z').collect();

However, doing the same over ranges that are potentially non-ASCII unicode is not so easy, due to the way unicode is defined.

Is there currently a solution for this, so that e.g. a range ä-ë would work in addition to ASCII-only ones?

:face_with_monocle: If the range contains no invalid unicode codepoint, then I suppose you can just iterate over u32 and use char::from_u32 ? If it does contain invalid codepoint then I have no idea.

No that's the issue, multi-codepoint grapheme clusters e.g. ë and 🙂 have to be supported too, and they can differ in length. So simply incrementing a counter won't work.

I'm not sure what exactly you mean by "work", since clearly, the equivalent compiles and runs. Is there something specific that you find incorrect in that output?

1 Like

How do you define a grapheme range though? Like what's expected result of range between some random graphemes?

1 Like

It's not so much that it's incorrect as that the example was perhaps a poor one, in retrospect.

I think a better example might be useful here. Consider the following ranges, defined using hex notation:

[#x20-#xD7FF] 
[#xE000-#xFFFD] 
[#x10000-#x10FFFF]

For each of these 3 ranges, I'd like each grapheme cluster ("character") contained within.
Generally the non-ASCII ranges will use this notation, so it's important to support it.
How can I accomplish this?

You can collect a range of chars into a string and then iterate the grapheme clusters: Rust Playground

Those look like ranges of scalar values (a subset of code points). Grapheme clusters can consist of multiple code points.

The encoding of a scalar value is variable with in UTF8, but can be fixed width in other encodings. Rust chars are unicode scalar values, represented directly as their value in 32 bit form, and is a fixed width encoding of scalar values.

Grapheme clusters are variable width in any encoding, as you can combine arbitrarily many (combining) code points.

It's unclear what practical goal you are trying to accomplish (XY problem), but there's a decent chance that iterating over scalar values is not it. But, you can do that; it's what the char..char examples do. The values you get out will tend to be somewhat related to their neighbors, but there's no guarantee they carry any inherent meaning. The values in sequence won't define any meaningful grapheme clusters except by coincidence, for example.

Unicode is complicated and there may not be a straight-forward way to do your practical goal in a way that works for all languages / scripts.

8 Likes

I think this hints at the real answer: you don't actually want "ranges", you want some property.

For example, maybe you never actually wanted 'A'..='Z', you wanted "uppercase letters", which is a general category in Unicode: https://www.unicode.org/reports/tr44/tr44-24.html#General_Category_Values.

So for example in regex https://docs.rs/regex/latest/regex/#matching-one-character you might use \p{Uppercase_Letter} as a regex instead of [A-Z].

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.