Is there another way of indexing a String rather than converting it to bytes?

I am on this page The Slice Type - The Rust Programming Language

And this is my code:

fn main()
{
    let x = String::from("He");
    let y = x.as_bytes();

    for (i, &items) in y.iter().enumerate()
    {
        if items == b'H'
        {
            println!("Found H at index {}", i);
        }
    }
}

In variable y I have converted x as a bytes. In my understanding as_bytes() is a function that converts the String into an array of characters and then converts it into u8, am I correct?

So is there another way to just index the String without having to convert it to anything else? And additionally is it possible to instead convert it to a character?

String is just a wrapper around Vec<u8>, i.e. there is no conversion when calling as_bytes(). If you need the characters, call chars(), instead.

To iterate over the bytes of a String, rather than going the route of calling as_bytes() and iter() and dereferencing the items, you can directly call str::bytes() to get an iterator of u8 values. Note in particular that any method on str can also be called on String due to the auto-deref steps in method resolution.

Oh then in that case what does as_bytes() do?

Short answer: No.

Long answer: What do you expect to get when you index a String?

Be aware that Strings are a sequence of bytes that represent Unicode characters in utf-8. As such characters are of varying length. And therefor it is not easy to find the Nth character in a String.

Also be aware that in Unicode, what we might think is single character, as printed on the page, may be composed of many utf-8 code points. For example when composing characters in many languages other than English. That makes it even harder to find where a particular character may be in a String.

Perhaps the unicode-segmentation crate will do what you want. It chops a String up into individual characters or words correctly according to the Unicode rules. https://crates.io/crates/unicode-segmentation.

Or, take the easy way out. Assume your program will only ever work in English and the ASCII char set is enough. Then you can index the string as bytes.

So you are saying some character takes up more than one byte? I thought utf-8 only takes up 8 bits?

UTF-8 is a variable-width encoding that works with 8-bit chunks. It’s backwards-compatible with 7-bit ASCII, and uses the 8th bit to signal the presence of a multibyte character that isn’t present in ASCII:

It simply returns the underlying buffer as a slice.

It gives immutable access to the underlying u8 buffer. This is as much access as a String could reasonably give to its buffer in a safe way — mutation through a &mut [u8] would be off-limits because the user of such a mutable reference could break the invariance that a String always contains valid UTF-8.

Yep, likely most characters in most languages take up more than one byte in Unicode.

UTF-8 is one means of encoding Unicode. It does it in a cunning way, such that original ASCII characters can be represented in a single byte. Other characters in other languages will require 2, 3 or even 4 bytes. UTF-8 - Wikipedia

Other Unicode encodings use 16 or 32 bits which is not efficient for storage for a lot of languages that are basically extended ASCII.

Even a 16 bit encoding will require two or more 16 bit words for many characters.

That's not even what he is talking about – UTF-8 is a variable-width encoding on several levels.

– What the user might see as a single "character" is called an (extended) grapheme cluster. A grapheme cluster is potentially composed of more than one code points. Grapheme clusters, because they have potentially unbounded length, need to be represented as full-fledged strings themselves. The unicode_segmentation crate allows one to iterate over extended grapheme clusters.
– A code point is a single thing that is uniquely represented by its number, e.g. U+000A is the newline \n. A code point is an abstract entity just like a natural number. It can be represented in memory in several possible ways. Incidentally, what Rust calls char is a code point. This name might be misleading, but the rationale behind this is that char needs to be a primitive, fixed-sized, simple type.
– UTF-8 is a format for encoding code points into bytes. It uses 1 to 4 bytes (called code units), depending on the numerical value of the code point, to encode the U+… number. Thus, one code point can be represented by 1 or more bytes. The str::as_bytes() method returns a view over a buffer of a sequence of such bytes.

there’s never more than two 16bit words in UTF-16.

String holds an array of u8 internally and enforces some additional structure over the contents of that array (that it is a valid UTF-8 sequence). That array isn’t converted to chars until they’re needed for printing or some other operation, because a Vec<char> will require 2-4 times as much memory as the encoded version.

If you really want one, you can get one of those by calling let v:Vec<char> = s.chars().collect(), but it’s almost always cheaper to work with the operations provided by String and &str.

Are you sure?

How would you represent this character: "ơ̶̢̡͍͎̮̱̬̰̣̞̭̟̞̈̎̓͂̃̇̈̈́̍̇̇͒̈́̽͠͝͝͠" in only two 16 bit words?
.
.
In utf-8 it is the byte sequence:

c3, b6, cc, b6, cc, 8e, cd, 83, , cd, 82, cc, 83, cc, 87, cc, 88
cd, 84, cc, 8d, cc, 87, cd, a0, , cc, 87, cd, 9d, cd, 92, cd, 9d
cc, 9b, cd, 84, cd, a0, cc, bd, , cd, 8d, cd, 8e, cc, ae, cc, b1
cc, ac, cc, b0, cc, a3, cc, 9e, , cc, a2, cc, ad, cc, 9f, cc, a1
cc, 9e

In hex of course.

Others have provided details, just to sum up:

  • you can iterate over bytes of a String which is encoded as utf-8
  • you can decode utf-8 bytes into unicode codepoints that will give you chars (e.g. U+0123)
  • one or more chars/codepoints can result in what we might call a "letter" that might be printed on a screen. The same letter can be composed using different combinations of codepoints, so it's important to use the same normalisation if you want to compare strings using bytes/codepoints. See Unicode equivalence for details.

Ah, damn you used the c-word*.

* “character”

Because of examples like this, I tend to think of strings more as an extremely-lossy image format than any sort of collection. At best, they’re a sequence of images separated by well-defined values, which is the view you get from the split_*() family of methods.

image

How did you do this?

https://lingojam.com/ZalgoText

You can use as many combining characters as you like to "stack up" diacritical marks, etc.