Is there another way of indexing a String rather than converting it to bytes?

I am on this page https://doc.rust-lang.org/book/ch04-03-slices.html

And this is my code:

fn main()
{
    let x = String::from("He");
    let y = x.as_bytes();

    for (i, &items) in y.iter().enumerate()
    {
        if items == b'H'
        {
            println!("Found H at index {}", i);
        }
    }
}

In variable y I have converted x as a bytes. In my understanding as_bytes() is a function that converts the String into an array of characters and then converts it into u8, am I correct?

So is there another way to just index the String without having to convert it to anything else? And additionally is it possible to instead convert it to a character?

String is just a wrapper around Vec<u8>, i.e. there is no conversion when calling as_bytes(). If you need the characters, call chars(), instead.

1 Like

To iterate over the bytes of a String, rather than going the route of calling as_bytes() and iter() and dereferencing the items, you can directly call str::bytes() to get an iterator of u8 values. Note in particular that any method on str can also be called on String due to the auto-deref steps in method resolution.

2 Likes

Oh then in that case what does as_bytes() do?

Short answer: No.

Long answer: What do you expect to get when you index a String?

Be aware that Strings are a sequence of bytes that represent Unicode characters in utf-8. As such characters are of varying length. And therefor it is not easy to find the Nth character in a String.

Also be aware that in Unicode, what we might think is single character, as printed on the page, may be composed of many utf-8 code points. For example when composing characters in many languages other than English. That makes it even harder to find where a particular character may be in a String.

Perhaps the unicode-segmentation crate will do what you want. It chops a String up into individual characters or words correctly according to the Unicode rules. https://crates.io/crates/unicode-segmentation.

Or, take the easy way out. Assume your program will only ever work in English and the ASCII char set is enough. Then you can index the string as bytes.

3 Likes

So you are saying some character takes up more than one byte? I thought utf-8 only takes up 8 bits?

UTF-8 is a variable-width encoding that works with 8-bit chunks. It’s backwards-compatible with 7-bit ASCII, and uses the 8th bit to signal the presence of a multibyte character that isn’t present in ASCII:

5 Likes

It simply returns the underlying buffer as a slice.

It gives immutable access to the underlying u8 buffer. This is as much access as a String could reasonably give to its buffer in a safe way — mutation through a &mut [u8] would be off-limits because the user of such a mutable reference could break the invariance that a String always contains valid UTF-8.

3 Likes

Yep, likely most characters in most languages take up more than one byte in Unicode.

UTF-8 is one means of encoding Unicode. It does it in a cunning way, such that original ASCII characters can be represented in a single byte. Other characters in other languages will require 2, 3 or even 4 bytes. https://en.wikipedia.org/wiki/UTF-8

Other Unicode encodings use 16 or 32 bits which is not efficient for storage for a lot of languages that are basically extended ASCII.

Even a 16 bit encoding will require two or more 16 bit words for many characters.

That's not even what he is talking about – UTF-8 is a variable-width encoding on several levels.

– What the user might see as a single "character" is called an (extended) grapheme cluster. A grapheme cluster is potentially composed of more than one code points. Grapheme clusters, because they have potentially unbounded length, need to be represented as full-fledged strings themselves. The unicode_segmentation crate allows one to iterate over extended grapheme clusters.
– A code point is a single thing that is uniquely represented by its number, e.g. U+000A is the newline \n. A code point is an abstract entity just like a natural number. It can be represented in memory in several possible ways. Incidentally, what Rust calls char is a code point. This name might be misleading, but the rationale behind this is that char needs to be a primitive, fixed-sized, simple type.
– UTF-8 is a format for encoding code points into bytes. It uses 1 to 4 bytes (called code units), depending on the numerical value of the code point, to encode the U+… number. Thus, one code point can be represented by 1 or more bytes. The str::as_bytes() method returns a view over a buffer of a sequence of such bytes.

5 Likes

there’s never more than two 16bit words in UTF-16.

String holds an array of u8 internally and enforces some additional structure over the contents of that array (that it is a valid UTF-8 sequence). That array isn’t converted to chars until they’re needed for printing or some other operation, because a Vec<char> will require 2-4 times as much memory as the encoded version.

If you really want one, you can get one of those by calling let v:Vec<char> = s.chars().collect(), but it’s almost always cheaper to work with the operations provided by String and &str.

2 Likes

Are you sure?

How would you represent this character: "ơ̶̢̡͍͎̮̱̬̰̣̞̭̟̞̈̎̓͂̃̇̈̈́̍̇̇͒̈́̽͠͝͝͠" in only two 16 bit words?
.
.
In utf-8 it is the byte sequence:

c3, b6, cc, b6, cc, 8e, cd, 83, , cd, 82, cc, 83, cc, 87, cc, 88
cd, 84, cc, 8d, cc, 87, cd, a0, , cc, 87, cd, 9d, cd, 92, cd, 9d
cc, 9b, cd, 84, cd, a0, cc, bd, , cd, 8d, cd, 8e, cc, ae, cc, b1
cc, ac, cc, b0, cc, a3, cc, 9e, , cc, a2, cc, ad, cc, 9f, cc, a1
cc, 9e

In hex of course.

1 Like

Others have provided details, just to sum up:

  • you can iterate over bytes of a String which is encoded as utf-8
  • you can decode utf-8 bytes into unicode codepoints that will give you chars (e.g. U+0123)
  • one or more chars/codepoints can result in what we might call a "letter" that might be printed on a screen. The same letter can be composed using different combinations of codepoints, so it's important to use the same normalisation if you want to compare strings using bytes/codepoints. See Unicode equivalence for details.
1 Like

Ah, damn you used the c-word*.

* “character”

1 Like

Because of examples like this, I tend to think of strings more as an extremely-lossy image format than any sort of collection. At best, they’re a sequence of images separated by well-defined values, which is the view you get from the split_*() family of methods.

2 Likes

image

How did you do this?

2 Likes

https://lingojam.com/ZalgoText

3 Likes

You can use as many combining characters as you like to "stack up" diacritical marks, etc.

2 Likes