An explaination of &str vs String

Hey guys,

I'm just starting to learn rust and spent some time on str vs String vs &str etc. I know this has been discussed a million times, but despite all forum and stackoverflow posts, I never got a feeling that I understood it completely, until recently I figured two points that were missing (for me). Here I'd like to describe my understanding and kindly ask you to check if it's correct or not.

So, let's say we have some type T except str. Then

  • [T] is type which represents a contiguous sequence of unspecified number of elements of type T in memory
  • &[T] is type which represents a reference to such a sequence
  • Vec<T> is a dynamically allocatable sequence of T

Now with strings there is a catch - we want them to be valid UTF-8 characters, but UTF-8 characters can occupy different number of bytes in memory. In Rust there is a primitive type char but it's always 4 bytes in size. Let's imagine there is a magical type uchar which represents a valid UTF-8 character but with minimal suitable size. Then

  • str would be [uchar]
  • &str would be &[uchar], i.e. a sequence of uchar somewhere in memory
  • String would be Vec<uchar>

This was the first point that I was missing - why do we need a separate type system for strings to begin with. Now the second point - there's a difference in how strings are handled.

When we write

let x = 1i16;
let y = 1i16;

what happens is that we have x and y represented in memory as 2 bytes each, encoded in this 2-bytes sequence is 1, so both x and y store value 1 independently:

...[0][1]...[0][1]...
   ------   ------
     x         y

But when we write a similar expression for strings

let x = "Hi";
let y = "Hi";

the situation is quite different although the syntax is the same - sequence of bytes representing "Hi" is most likely hardcoded in binary itself and loaded into memory when program starts, so x and y are only references to that point in memory - this is why they are of type &str. Both x and y reference the same point of memory (?not sure about this), so only a single instance of "Hi" is actually allocated.

    
...['H']['i']...
   ----------
        |
x ------|
y ------|

If we want to have them stored (and possibly updated) separately, we should be using String insted.

Please search the forum using the search tool at the top before asking questions. This question has been asked before,

Here is my answer to the last time it was asked.

1 Like

As a ex-monad-tutorial-writer, I completely agree that everyone should have their own str-vs-String tutorial.

2 Likes

Yes, I searched the forum and read your post. There were still some questions that I wanted to make sure I got correctly, this is why I created this topic. Sorry for bothering you.

Let me rephrase my questions here:

  1. when we want to create an array of, say, i16, we don't use a special name for it, we just use [i16]. But for an array of UTF-8 characters we do have a special name - str. Why is that? Because there is no analog of i16 to represent a single UTF-8 character? char is fixed size so isn't suitable here?

  2. when we have an expression like let x = 1 it means that compiler should allocate required number of bytes of memory and put 1 into it. But when we have let x = "Hi" the meaning is totally different - compiler should put "Hi" somewhere in the binary and x only holds a reference. I understand that this is most likely because "Hi" is of type &'static str. Am I correct? Would compiler do the same trick in case of, say, let x: &'static [i16] = [1,2,3]?

  3. related to question 2. If there are many variable that are initialized to the same string let x = "Hi"; let y = "Hi", will they actuall reference the same point in memory? So "Hi" is only allocated once?

The uchar type you are suggesting would have to be ?Sized. Therefore, [uchar] cannot exist for the same reason that [[u8]] cannot exist:

  • When the compiler tries to allocate a Box<[T]>, it computes the size as effectively n * size_of::<T>(). But size_of::<uchar>() is not defined.
  • The stride between elements in a slice must be fixed, as it is supposed to be possible to index in O(1) time. The only possible way that impl Index for [uchar] could work is that s[i] would have to do something similar to s.chars().nth(i).unwrap(); i.e. it would have to validate all of the UTF-8 up to the ith code-point just to find where it begins.

Yes. This is called static promotion. You are allowed to take a reference to any const expression, and the compiler will automatically create a static variable initialized to said const.

3 Likes

Yes, you are more or less right. String is internally just a Vec<u8>, and str is just [u8], with some protections (if used in safe manner, it will ensure that it contains valid utf-8 string) and fancy api, which allows to work on it so it looks like it contains char elements. There are even functions as_bytes and as_bytes_mut on str which allow you to access underlying buffer, and into_bytes on String which gives you back whole buffer as Vec<u8>.

Also you are right about optimization, with one exception. Not let a: &'a static [u32] = [1, 2, 3], because [1, 2, 3] is of type [T], not &[T]. You need to write let a = &[1, 2, 3], lifetime is obsolete.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.