What are String and str?

I am trying to read the Rust book and do the exercism exercises...

Currently,I am trying to figure out what the different string types are and how they are converted and when to use which one.

let s: () = "hello" suggests that a string literal has type &'static str. So that is a reference to a str with static lifetime, I guess.

"hello".to_string() gives us a std::string::String the great high level type of wonderfulness. Which has an append function that is for some reason called push_str() instead of append() :roll_eyes: but I digress.

Then there is a slice "hello".to_string()[3..5] which has type str and the interesting property that it cannot be printed because "the size is not known an compile time". What is going on?

I can print it with println!("{:?}", &"hello".to_string()[3..5]); which I suppose turns it into a &str but I don't understand why that helps. I guess something magical that is happening inside the println! macro.

Then this part of the book suggests that all functions which work on strings should take &str, if I understand it correctly.
Now if I define such a function and I try to call it with a std::string::String it obviously fails. Adding a reference to a string gives me a &std::string::String as expected. What is surprising is that I can pass a &std::string::String to something that wants a &str.

So...since Strings are kind of an important data type I am pretty surprised by this "mess".
Can somebody shed some light on how I should interpret these things to get a more coherent picture?

Edit: I forgot to mention that it seems I can use String::from("hello") or "hello".to_string() to achieve the same, is that correct?

Edit 2: If it helps, I am aware that the distinction is probably similar to the one in C++ but I am still confused. :slight_smile:

2 Likes

Definitions

Let's start with definitions

String is a growable buffer that owns it's data.
&String is a shared borrow of a growable buffer, this borrow does not own any data
&mut String is a unique borrow of a growable buffer, this borrow does not own any data

str is a view into some other string that is already in memory.
&str is a shared borrow of a string view.
&mut str is a unique borrow of a string view.

String literals

String literals, like "hello world" or "bob" are stored directly into the program binary, and we interact with a shared borrow of a string view to that string. Which, if we look at are definitions above is &str. Only thing missing is the lifetime, because the literals are stored in the program binary, they have a 'static lifetime. Put all of this together and we get &'static str.

Creating a String

You can create a string using some traits that str and String implement ToString::to_string or From<&str>. ToString is implemented on every type that implements Display, so that you can easily create String from those types. Because str and Srring implement Display you get to use ToString. String also implements From<&str> because it can be losslessly converted from a &str in an obvious way.

Less commonly, you may see people use ToOwned to create a String. str: ToOwned<Owned = String>, so you can use string_view.to_owned() to get a String.

Because String: From<&str> you also get Into<String> for &str for free, so you can do string_view.into() to get a string. But this isn't all that great because Rust has a hard time doing type inference with the Into trait, so I would stick to to_string, to_owned, or String::from.

Indexing

String and str can be indexed to produce a string view into their contents using ranges. Like so

let string: String = String::from("Hello World");

let view: str = string[..];

Oh no, this doesn't compile, why?

Well, if we look at our definitions from before we can see that str is a view into some other data. But how big is that view? Can't be known until run-time. So str does not implement Sized. This means that it must always be behind some indirection, like a borrow (& or &mut).

playground

let string: String = String::from("Hello World");

let view: &str = &string[..]; // fixed

Note about your println example, you added a & on your own, so it works. Not magic on the println macro's end.

Deref Coercions

Before we dive into this with String and str, lets look at a simpler example

playground

use std::ops::Deref;

struct Foo;
struct Bar(Foo);

impl Deref for Bar {
    type Target = Foo;
    
    fn deref(&self) -> &Foo {
        &self.0
    }
}

fn main() {
    let bar        :  Bar = Bar(Foo);
    let bar_borrow : &Bar = &bar;
    let foo_borrow : &Foo = bar_borrow;
    let foo_borrow : &Foo = &bar;
}

Witchcraft. How are we converting between types! This has to do with the so called deref coercions. Basically if Rust thinks that you have mismatched borrow types, Rust will try and apply a special coercion, which depends on the Deref and DerefMut traits. This conversion will convert the borrows to borrows of Deref::Target.
String implements Deref<Target = str>, so it can participate in these coercions.

playground

fn main() {
    let string        :  String = String::from("Hello World");
    let string_borrow : &String = &string;
    let view_borrow   : &str    = string_borrow;
    let view_borrow   : &str    = &string;
}

Book Suggestion

The book suggests always taking &str for function arguments because it gives you more ergonomics are flexibility, and I will add only one caveat, if you are going to be storing a String then take a String to avoid unnecessary allocations.

Cow<'a, str>

On the topic of functions, if you are unsure that you will allocate (due to conditional allocations or something similar), and you would like to avoid unnecessary allocations you can use the Cow<'a, str> type. This type can represent both a string view and an owned string.

C++

This is similar to the distinction between string_view and string in C++, which correspond to &str and String in Rust respectively.

25 Likes

The reason this works is explained much later in the book:
https://doc.rust-lang.org/book/ch15-02-deref.html#implicit-deref-coercions-with-functions-and-methods

2 Likes

Everything @RustyYato said is great.

An addendum:

  • str is a fundamental type in the language. It has to be, because pointers to str must be fat pointers. (&str, &mut str, Box<str> etc. are all internally represented as (pointer, len) instead of just a pointer).

  • String is just a standard library type. Think of String as a "heap-allocated, growable buffer for a str". (as opposed to e.g. Box<str> which is not growable, and &str which is not directly responsible for managing a heap allocation)

  • The same distinction exists between &[T] and Vec<T>.

2 Likes

Thanks for all the explanations!

A small follow-up question about Deref:
Aren't these implicit coercions opaque and a potential source of bugs?

Yes they are implicit, yes they can cause bugs. But they generally don't cause bugs, because people tend to be responsible about them. Using them to emulate sub-typing is a bug though, and one some beginners fall into.

One example of a potential bug that was pointed out was blocking field access!

But this requires that the person implementing Deref be malicious, so it's not seen in practice.

I am also very interested in this Sized trait

Am I correctly assuming that it basically means that a type without it cannot be on the stack because it's size cannot be determined?

Also, why can't the size of a String slice be determined at compile time? The slice size is constant in my example.

You are correct that, in theory, Deref enables arbitrary code to execute "invisibly."

However, the signature is very restrictive:

pub trait Deref {
    type Target;

    pub fn deref(&self) -> &Self::Target;
}

Because Deref::deref takes &self, it is impossible for the function to mutate the members of self. Basically all that most Deref impls can possibly do is to return a pointer to existing data on self.


Now, yes, it could access globals, or try to cheat by using interior mutability types like RefCell or Mutex. In all of my time using Rust I've never written or seen a type that does this.

2 Likes

Exactly

let x = String::from("Hello World");

let y: &str = &x[..];
let z: &str = &x[2..];

y and z both have the same type, but that they clearly don't have the same length. Which is why str does not implement Sized.

2 Likes

To implement it you would need something like:

  • A type that represents a str on the stack. Like, str_array<7> for a 7-byte str.
    • It would either need to be a language type like [T; n], or it would have to wait for const generics to become stable so that it can be generic over the size.
  • The compiler would need to somehow special case str[3..5] to produce a str_array. But both str[3..5] and str[some_function()..b] are supplying the same argument type std::ops::Range<usize>, so they have to return the same output type.
1 Like

Since you mentioned C++:

Rust's String is roughly equivalent to C++'s std::string type. They own a (possibly dynamically allocated) buffer which stores the characters of your string.

Rust's &str is roughly equivalent to C++'s std::string_view. You can think of them as non-owning pointers to a slice of memory. They are not pointers to a single character. They point to the whole string's sequence of bytes. Under the hood this is achieved by storing a pointer to the beginning of the buffer as well as a length (or two pointers for beginning and end). C++ doesn't have native support for such a "slice pointer" in the core language which is why they they provide a user-defined type in the standard library which emulates a "slice pointer"'s behaviour.

Rust kind of generalizes what a pointee type can be:

// C++
char*       // pointer to char
char(*)[5]  // pointer to char[5]
char(*)[]   // ILLEGAL

// Rust
&u8       // reference to an u8
&[u8;5]   // reference to an [u8; 5]
&[u8]     // reference to an [u8]

Rust makes this work by turning &[u8] into a "fat pointer" that stores both, a pointer and a length. But there are still limitations involving [T]. It's a so-called "dynamically sized type" and doesn't "work" unless it's behind a pointer or reference. You cannot hold a [T] directly:

fn does_not_work(s: &[i32]) {
    let x = *s; // <-- invalid
}

The compiler likes to know at compile-time how much space to reserve on the stack for x. But the length of that slice is a runtime property and part of the fat pointer s. So, the length is not statically known.

Why did I bring up a type like [T]? Because str is similar. You can think of str as [u8] with the additional guarantee that it's a valid UTF-8 encoding.

On top of this, you should know that a &String can be coerced into a &str which makes this work:

fn show(s: &str) {}
fn main() {
    let s: String = "hello".into();
    show(&s);
}

This is because String implements Deref<str>.

4 Likes

This, imho, should be the start point. What are Vec<u8> and [u8]?

  • [u8] represents the type for sequences of bytes of any length (that is, the length is a runtime property). In Rust parlance, this is called a slice.

    • Since the length is only known at runtime (!Sized), a slice cannot be inlined into the stack, since stack memory is managed with compile-time (and thus fixed) parameters. This is what prevents us from using !Sized stuff directly.

      We can circumvent this restriction with indirection: any sequence of bytes, whatever its length may be, once in memory, can be referred to by a reference / pointer to the first element and a second field with the number of elements (we call this a fat pointer). This is the case of, for instance:

      • shared reference to a slice, &[u8] (or more generally, &[T]),

      • unique reference to a slice, &mut [u8] (or more generally, &mut [T]),

      • and owning references / pointers, such as Box<[u8]>, Rc<[u8]>, Arc<[u8]>.

  • One way to crate an element with variable length (dynamic allocation) is by using the heap. This works in multiple steps:

  1. We ask the heap-allocator for a chunk of memory able to hold capacity elements;

  2. If the allocator succeeds, we get back a pointer to the heap, to the beginning of the allocated (but uninitialised memory);

  3. We can then initialiase any len number of elements, so as long as len <= capacity (else a reallocation is needed).

  • That's why such a heap-managed structured must have at least these three fields (ptr, len, capacity), and this is exactly what a Vec<u8> (or more generally, a Vec<T>) is.

    • a corollary of that is that from a ptr, len, capacity tuple, we can choose to keep ptr, len only. ptr... len... This rings a bell... Oh, right, we have successfully managed to have a reference to a slice!

Ok, ok, but the OP asked about String and str, what has anything to do with it?

Very simple:

String / str is exactly like Vec<u8> / [u8], except that the sequence of bytes must uphold a property / invariant: them being valid utf-8.

That's why there are trivial conversions (casts) from the formers to the latters (<Vec<u8> as From<String>>::from and str::as_bytes), whereas the other way around requires a runtime-checked cast.


cough lazy_static cough :stuck_out_tongue:

12 Likes

Oops! You got me there. (especially after I wrote this... tsk!)

3 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.