Is there another way of indexing a String rather than converting it to bytes?

Since you provided this example code and there’s a lot of discussion about Unicode here now... In case you’re wondering: UTF-8 is designed in a way that does ensure that every byte that’s equal to b'H' will also actually stand for the codepoint 'H' in your String. This means that this code “works” in a sense that if won’t misdetect other codepoints. Also it probably is slightly more efficient than searching by char. Having an index in bytes allows for the index to be used for slicing or indexing back into the string in constant time. If you wrote the same with chars() and enumerate(), you’d get rather useless char indices. There’s also the char_indices() methods to iterate through all chars in a String / str together with their byte indices. (So you might use that if you ever wanted to write the same code but for something other than 'H' that is not in the ASCII range.)

Speaking about codepoints, note that your code can also find “characters” that are not “H”. Letters with accents for example can be represented either by special code-points if they exist, like “Ḧ” (U+1E26), which your code won’t find, or with combining accents, like “Ḧ” (U+0048 “H” followed by U+0308 “◌̈”; i.e. the letter H followed by a combining diaeresis), consisting of two code-points where your code would find the first one. One problem is: these two ways of representing 'Ḧ' look identical and are also supposed to be treated identically (someone else already mentioned unicode equivalence above). For example my browser (Google Chrome) does highlight both versions of Ḧ above when I CTRL+F search for “H”.

2 Likes

This is frighteningly true. For years now I'm sure you have seen web pages all around the place where the text is corrupted and there is all kind of rectangles and junk printed where readable characters are expected.

Why? Because of missing or erroneous Unicode support.

If you want your text to be uncorrupted as it travels around then net, in an out of all kind of format conversions and so on do not use String. Use ASCII in Vec.

2 Likes
s.chars().collect();

Does not do what I might want. For example:

    let zalgo = "Ḧ̷̛̰̦̞̟̳̫́͋̏̏͛͑́̈͊͆̀͗̎͝͝e̵͍͔̟͈̖͙͙͉̲̓̊̈́ĺ̴̮͕̩͙̼̈́͊l̴̨̹̭̠̲̱̹̩̟̫̟̊̀̓͌͘ǫ̴͚̫̰̥͉̙̝͔̣̘̽̔͐̿͊͊͋̍̐̎̄̽̓̏̀ ̴̼̼͚̐̀̔̽́̄͘̚͝ẁ̷͔̒ơ̵̧̢̛̥̞̼̰̠̘͈̯̫̫̦̣͙͇̆̈́̈́̾̄̾̇͊̆̚r̶̺͇̱̟̻̭͍̠͉̬͈̙̰͍̭͍͕͎͂̋̋̋͊̽̂̚ļ̷̢̨̗̘͎̰̙͎͔̜̫̱̻͓̫̝̍͂̒̂̽̆̇̆̊͠d̷͇̼̞̝̯̤̝͕̟͉̎͂̈́͝͠!̴̗̮͚̗͎͗̊̂";


    let s: String = zalgo.to_string();
    println!("{}", s);

    let v: Vec<char> = s.chars().collect();

    for c in v {
        println!("{}", c);
    }

Outputs:

H
̷
̈́
͋
͝
̏
̏
...

So I did not end up with the 12 characters I might have expected.

1 Like

This strategy only works for English text, which is why most programming languages can get away with it for their source code. Other latin-alphabet languages are probably intelligible without their diacritic marks, but this strategy is wholly unworkable for any other languages.

2 Likes

some also have official rules for spelling in situations where diacritics are not available. For example German can be written without diacritics and special letters. You replace ä with ae, ö with oe, etc, Ä with Ae or AE (depending on if the whole word is in capitals or just the first letter), etc, and ß with ss.

In the case of German the reason for such rules’ existence is that the diacritics and special characters are all historically ligatures (but nowadays they are considered distinct letters), and their replacements are just the parts the ligatures are made up from.

1 Like

Right; because char represents a codepoint and not a grapheme (unfortunate historical naming, here). You’ll need some external crate to find the grapheme boundaries, and then you’d probably store them in a Vec<&str> or something. That’s assuming, of course, that the unicode grapheme definitions have some reasonable semantic meaning for the language the text is written in.

Very true. But as I specified ASCII. So only American text then :slight_smile:

Even English has some oddities in it. See:

$ tail -n118 /usr/share/dict/british-english-insane
Österreich
Österreich's
Übermensch
Übermensch's
Übermenschen
Übermenschen's
åsar
ébauche
éboulement
éboulement's
éboulements
ébrillade
ébrillade's
ébrillades
ébéniste
ébénistes
écarté
écarté's
écartés
échappé
échappé's
échappés
éclair
éclair's
éclaircissement
éclaircissement's
éclairs
éclat
éclat's
éclats
écorché
écorché's
écorchés
écossaise
écossaise's
écossaises
écraseur
écraseur's
écraseurs
écritoire
écritoires
écuelle
écuelle's
écuelles
écurie
écurie's
écuries
égarement
élan
élan's
éloge
éloge's
éloges
émeute
émeute's
émeutes
émigré
émigré's
émigrés
éolienne
épatant
éperdu
éperdue
épicier
épicier's
épiciers
épris
éprise
éprouvette
éprouvette's
éprouvettes
épuisé
épuisée
épée
épée's
épées
équipe
équipe's
équipes
étage
étage's
étages
étagère
étagère's
étagères
étalage
étalages
étape
étape's
étapes
état
étoile
étoile's
étoiles
étourderie
étourdi
étourdie
étrangèr
étrangèr's
étrangère
étrangère's
étrangères
étrangèrs
étrennes
étrenness
étrier
étrier's
étriers
étude
étude's
études
étui
étui's
étuis
évolué
évolués
événement
événements
1 Like

ASCII is not even sufficient for American english, unless you are in control of the input. It would be naïve to assume so. :wink:

12 Likes

Out of curiosity I made a playground to print the grapheme clusters of zalgo.

EDIT: I copied from the forum to playground on Android to input the zalgo, it also looks wrong in my source code which won't help

It isn't pretty! :frowning:

I am on mobile though, let me know if you also see a mess of rectangles (or maybe this is to be expected?)

1 Like

They are all getting printed on new lines and overlapping vertically into a mess.

You can put them all on one line with some space between: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=8e979c34a6335a8bce4501772f7dc1a3

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.