Confusion about strings


#1

In the rust programming language book, in section 4.17, Strings, in the subsection on Slicing, it says:

But note that these are byte offsets, not character offsets. So this will fail at runtime:
let dog = "忠犬ハチ公";
let hachi = &dog[0..2];

Just to make sure I’m keeping my facts straight, and so someone can easily point out my specific misunderstandings, I’m falling back on a little formal logic. So, here’s what I already know about Unicode that is true regardless of what programming language is involved:

  • All valid utf-8 encoded strings are also sequences of bytes.
  • Some Utf-8 encoded strings have multibyte codepoints.
  • Some byte sequences taken from utf-8 encoded strings are not aligned with boundaries between codepoints.
  • All byte sequences that begin at points not aligned with boundaries between codepoints are not valid utf-8 encoded strings.
  • Some byte sequences taken from a valid utf-8 encoded string are not themselves also valid utf-8 encoded strings.
  • Some byte sequences are valid utf-8 encoded strings.

If I got those wrong, something about my understanding of utf-8 is wrong…

Now, here’s what I’ve come to understand from this part of the rust book:

  • All slices of Strings are byte sequences.
  • Some slices of Strings are not valid utf-8 encoded strings.
  • All Strings are required to be valid utf-8 encoded strings.
  • Some slices of Strings are not Strings
  • Assigning a slice of a String to a variable creates a value of type String.
  • Some slices of strings cannot be assigned to variables.
  • Some concatenations of Strings with slices of strings are invalid Strings.

So, it really almost sounds like slices are treated as bytes up until you make an assignment with it, like let x = &y[n..m]; at which point it’s expected to represent a valid utf-8 encoding?
I think the understanding I’ve come to from reading this part of the book is rather implausible. It just wouldn’t make any sense for slices to work that way, so I suspect I’m wrong about something. Please tell me where I’m misunderstanding rust.

Thanks, and happy $holiday!


#2

I’m not an expert on Rust strings, but here we go.

All your points about utf-8 are correct. Regarding Rust strings, there is some confusion (for me, they were harder than the borrow checker :smile:)

A slices of a String (which has a type str) is an utf-8 encoded unicode string. That is, str is a sequence of bytes (as [u8]) with additional contract that it is a valid utf-8. You can create a str by slicing a String on byte indexes. It is checked at a creation time that the slice is a valid utf-8. If it is not the case, a panic occurs.

So, both String and str are required to be a valid utf-8 and, if you try to violate this invariant, you get a panic at runtime. If you want to work with bytes instead of code points, you should look into Vec<u8> and [u8] types.

The assignments have nothing to do with strings. That is "忠犬ハチ公"[0..2] will panic without any assignments. Assigning a value of type str to a varibale does not create a String. The type of the variable will be str.

Here are reference docs for
String: https://doc.rust-lang.org/std/string/struct.String.html
str: https://doc.rust-lang.org/std/primitive.str.html


#3

Unfortunately, some parts of your perceived understanding are false.

It’s not the assignment part that’s not allowed; the error-checking with slicing occurs during the slice operation itself. For an expression &my_string[m .. n], the slice operation will panic if the indices m or n do not fall on valid UTF-8 codepoints. Thus, in safe Rust, it is (or should not be) possible to create a String or a slice of a String which is not valid UTF-8.

The other part you’re missing is the difference between String and str. The former is an owned string; it wraps a Vec<u8> and is responsible for deallocating it when it falls out of scope. The latter, usually seen as &str, is merely a view into the former–it’s just a pointer and a length. It has a lifetime and cannot outlive the original value, or else it becomes a dangling reference, which is not allowed in safe Rust. A slice does not become an owned String without an explicit conversion, like calling the .to_string() or to_owned() methods, or calling .into() with an inferred target of type String.

Thus,

  • All slices of Strings are of the str type. Due to how dynamically sized types (DSTs) work in Rust, you can’t have a bare str; it has to be behind a pointer type such as Box or &, though the latter is far more common.
  • Any slice that can be created in safe Rust without panicking is a valid slice.
  • Any concatenation of slices or Strings created without panicking in safe Rust is a new valid String.
  • Any slice or String created in safe Rust without panicking can be assigned to any lvalue, be it a variable or a struct field or what-have-you (with the exception of const or static lvalues and String, since String can only be created at runtime).

#4

Thanks for your responses, this helps a lot. Let’s see if I’m understanding it better now…

In order to create a slice (which is required to be valid utf-8 just as much as a String is), I need to know ahead of time where the byte offsets are of the relevant codepoints, or I could get a panic by creating an invalid slice. Is there an API for mapping codepoints to bytes, then? Is there something like a slice, but that I can use to access a substring by the offset of codepoints instead of bytes? I guess now I don’t really see how slices would be very useful. I think it would be nice if "忠犬ハチ公"[0..2] would access codepoints 0…2 instead of bytes 0…2. Otherwise, using it correctly would require a more verbose wrapper or something.

DroidLogician, I think you cleared a lot up for me by describing a &str as a view into another string. That makes a lot of sense in the context of knowing that the original value is the owner of the data, so the &str can’t outlve it.

So I think something else that confused me is that str is the type of the String. I think I somehow thought the convention is that types are capitalized. Maybe I understood it backwards.

Thanks again for helping me understand rust’s strings.


#5

The char_indices function will give you an iterator over codepoints along with their byte indices.

Are codepoints really what you want to be indexing by? A single visual “character” can consist of an arbitrary number of codepoints. The unicode_segmentation crate provides a grapheme_indices iterator which behaves like char_indices but iterates over graphemes which map more closely to user perceived “characters”.