Iterating over non-ASCII ranges

jjpe · May 10, 2023, 11:03am

Iterating over a range of characters is easy e.g. A-Z can be generated using

let chars: Vec<_> = ('A'..='Z').collect();

However, doing the same over ranges that are potentially non-ASCII unicode is not so easy, due to the way unicode is defined.

Is there currently a solution for this, so that e.g. a range ä-ë would work in addition to ASCII-only ones?

zirconium-n · May 10, 2023, 11:09am

If the range contains no invalid unicode codepoint, then I suppose you can just iterate over u32 and use char::from_u32 ? If it does contain invalid codepoint then I have no idea.

jjpe · May 10, 2023, 11:12am

No that's the issue, multi-codepoint grapheme clusters e.g. ë and 🙂 have to be supported too, and they can differ in length. So simply incrementing a counter won't work.

H2CO3 · May 10, 2023, 11:12am

I'm not sure what exactly you mean by "work", since clearly, the equivalent compiles and runs. Is there something specific that you find incorrect in that output?

zirconium-n · May 10, 2023, 11:15am

How do you define a grapheme range though? Like what's expected result of range between some random graphemes?

jjpe · May 10, 2023, 6:07pm

It's not so much that it's incorrect as that the example was perhaps a poor one, in retrospect.

I think a better example might be useful here. Consider the following ranges, defined using hex notation:

[#x20-#xD7FF] 
[#xE000-#xFFFD] 
[#x10000-#x10FFFF]

For each of these 3 ranges, I'd like each grapheme cluster ("character") contained within.
Generally the non-ASCII ranges will use this notation, so it's important to support it.
How can I accomplish this?

jendrikw · May 10, 2023, 6:27pm

You can collect a range of chars into a string and then iterate the grapheme clusters: Rust Playground

quinedot · May 10, 2023, 6:39pm

Those look like ranges of scalar values (a subset of code points). Grapheme clusters can consist of multiple code points.

The encoding of a scalar value is variable with in UTF8, but can be fixed width in other encodings. Rust chars are unicode scalar values, represented directly as their value in 32 bit form, and is a fixed width encoding of scalar values.

Grapheme clusters are variable width in any encoding, as you can combine arbitrarily many (combining) code points.

It's unclear what practical goal you are trying to accomplish (XY problem), but there's a decent chance that iterating over scalar values is not it. But, you can do that; it's what the char..char examples do. The values you get out will tend to be somewhat related to their neighbors, but there's no guarantee they carry any inherent meaning. The values in sequence won't define any meaningful grapheme clusters except by coincidence, for example.

Unicode is complicated and there may not be a straight-forward way to do your practical goal in a way that works for all languages / scripts.

scottmcm · May 10, 2023, 8:44pm

I think this hints at the real answer: you don't actually want "ranges", you want some property.

For example, maybe you never actually wanted 'A'..='Z', you wanted "uppercase letters", which is a general category in Unicode: https://www.unicode.org/reports/tr44/tr44-24.html#General_Category_Values.

So for example in regex https://docs.rs/regex/latest/regex/#matching-one-character you might use \p{Uppercase_Letter} as a regex instead of [A-Z].

system · August 8, 2023, 8:45pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
How do you iterate over grapheme clusters of a String in Rust?	11	14626	July 3, 2022
How to iterate over emojis / grapheme clusters? help	5	1934	January 12, 2023
Tracking position in unicode-enabled lexer, best practice? help	4	378	September 12, 2023
Iterating over chars from readers help	5	1170	January 12, 2023
Best Way to Slice a Unicode String while Iterating code review	10	507	February 22, 2021

Iterating over non-ASCII ranges

Related topics