Since you provided this example code and there’s a lot of discussion about Unicode here now... In case you’re wondering: UTF-8 is designed in a way that does ensure that every byte that’s equal to b'H'
will also actually stand for the codepoint 'H'
in your String. This means that this code “works” in a sense that if won’t misdetect other codepoints. Also it probably is slightly more efficient than searching by char
. Having an index in bytes allows for the index to be used for slicing or indexing back into the string in constant time. If you wrote the same with chars()
and enumerate()
, you’d get rather useless char indices. There’s also the char_indices()
methods to iterate through all char
s in a String
/ str
together with their byte indices. (So you might use that if you ever wanted to write the same code but for something other than 'H'
that is not in the ASCII range.)
Speaking about codepoints, note that your code can also find “characters” that are not “H”. Letters with accents for example can be represented either by special code-points if they exist, like “Ḧ” (U+1E26), which your code won’t find, or with combining accents, like “Ḧ” (U+0048 “H” followed by U+0308 “◌̈”; i.e. the letter H followed by a combining diaeresis), consisting of two code-points where your code would find the first one. One problem is: these two ways of representing 'Ḧ'
look identical and are also supposed to be treated identically (someone else already mentioned unicode equivalence above). For example my browser (Google Chrome) does highlight both versions of Ḧ above when I CTRL+F search for “H”.