Noob question: why "&str"?

You talk about all the unique behaviors of &str compared to e.g. &i32 but they are actually quite similar.

I mean, for one they both reference some sort of object in memory, and are borrows of that object with lifetimes to ensure it lives long enough. Sure one of them is larger than the other, but this also happens for trait objects such as &dyn Trait that need to keep a vtable around. Arguably this isn't a pointer either as it has extra information, but we still call it a fat pointer (even in C++).

3 Likes

I'm new to Rust so many things confuse me including strings. But then strings are confusing in pretty much all languages, especially C++.

The way I see it is we want to be able to write things like:

let hello_world = "Hello, world!";

So what type is 'hello_world'?

We clearly want it to be some kind of string. OK let's call it 'str' that is nice and succinct and a well known abbreviation.

But we don't want to pass 'str' around as a value, that could be a lot of bytes, so it had better be some kind of reference, a pointer thing pointing to the actual bytes:

let hello_world: &str = "Hello, world!";

Of course in Rust we have lifetimes to think about so the full type we need is:

let hello_world: &'static str = "Hello, world!";

As explained here:
https://doc.rust-lang.org/std/str/

Where it clearly states &str is a slice, presumably with a pointer and a length or perhaps a begin and end pointer, whatever a 'fat pointer' as noted above.

All in all strings in Rust end up being pretty sensible compared to C++. Especially if Unicode is involved.

4 Likes

Not a "noob" question at all: This is a really interesting and non-trivial language design trade-off!

This is all correct, although "not really a reference/pointer" depends on interpretation. &str is really a shared reference to a str in terms of the borrow checker, and the fact that dereferencing it produces a str value, and the fact that borrowing a str value produces a &str value, and so on. But it's also true that &str is a "fat pointer" / has a larger memory footprint than a raw pointer.

The basic idea is that Rust references provide more safety and correctness guarantees than C++ references, and providing those guarantees often requires carrying around some metadata in addition to the familiar pointer. For view types, that metadata is just a length, and for trait objects that metadata is a vtable, but the principle is the same.

The details include but are not limited to:

  • "object slicing" is impossible in Rust. The equivalent syntax simply won't compile, because Rust expects you to go through trait objects if you really want runtime polymorphism.
  • The weirdly subtle rule for when a destructor has to be virtual doesn't exist in Rust, because trait objects are a separate type provided by the Rust language. You don't need to and can't declare or implement a trait object's drop() method yourself, and thus you can't possibly get it wrong.
  • The whole virtual inheritance thing is basically an ad hoc solution to "how do we prevent multiple inheritance from slicing the base object?". Again, Rust trait objects don't have this problem because they neither allow nor require you to muck with their implementation.
  • Like in C++, view types such as &[T] and &str keep the pointer and the length together, which makes it possible to do basic operations like indexing safely.
  • Unlike in C++, view types such as &[T] and &str obey all of the usual borrow checking rules in Rust, so the equivalent of a dangling/invalidated string_view is guaranteed to be a compile-time error.
  • String literals are a safe &str in Rust, rather than an unsafe const char* in C++.

AFAIK, the fact that C++ implemented string_view as a library rather than a core language feature is not because they believe a library solution is better, but because backwards compatibility rules out any good core language solutions. For example, C++ can never change string literals to be string_views.

The only thing that's actually unique about the &str type in particular is not that it's a reference instead of a library type, or a fat pointer instead of a thin pointer, but that it's a distinct type from &[u8]. And that's "only" because &str guarantees that it's valid UTF-8. In this respect, C++ will actually be adding its own equivalents in the future (the proposals I've seen call them std::text and std::text_view).

9 Likes

In terms of naming I think calling it a "string view" would be super helpful for novice users. Lots of noobs see "hello world" is a &str and assume that's something like a String* type and get lost in a borrow checker swamp.

As to why it's actually called like that, you can probably find the answer somewhere in old issues or commits on GitHub. I wouldn't be surprised if it was just an accident. Rust 0.x used to have super short keywords like ret for return, and it had ~[T] for owned and &[T] for borrowed types, so &str may have survived from that era, unlike other syntax and names.

2 Likes

I recall seeing several Rust team members lamenting that String and str should have been called StringBuf and String (like PathBuf and Path) with the benefit of hindsight.

7 Likes

Then there's also the issue of Deref. If slices and string slices weren't pointers, how would you impl Deref<Target=???> for types like Vec and String? In general, how would you deal with &T in a generic manner if you had to always special-case when T == [U] or T == str?

1 Like

Just for completeness: The official C++ answer is probably type traits. Behold std::pointer_traits.

Hopefully it's already clear why I think Rust's fat pointers are a better answer.

1 Like

Fair enough. As usual, the C++ solution is not to make stuff uniform; instead, it is to add even more non-uniform stuff to handle bad insights of the past.

2 Likes

Thank you all for the excellent explanations! Now I start to get it. I think the stubborn C++ mindset in me is that "a reference/borrow has to be a (thin) pointer, and that's it". This view is not general enough to understand borrow in Rust. With all the traits/fat point explanations now I can see "&str" is not a unique thing -- it follows all the borrow checker rules just the same as all the borrows.

My takeaways:

  1. Borrow is not a C++ reference / (thin) pointer.
  2. Borrow is more of a semantic concept. Focusing on the memory layout will fallback to the old C legacies and that's view limiting.
10 Likes

To add a little more color, let's also consider that &str is not actually a reference to a String. It's a reference to a contiguous sequence of bytes that are guaranteed (in safe rust) to be valid utf8. That may be a String, which allocates its data on the heap , but it could also be a type from a third party crate like smallvec, or bytes compiled intoyour binary, as in the case of str literals, or even -(with the help of a little unsafe rust)- a fixed size array.

So there's no real way to look at &str as a view into some other particular owning type.

5 Likes

No unsafe code required for stack allocated &str

let mut buf = [0; 1024];
let len = std::io::stdin().read(&mut buf)?;
let line = std::str::from_utf8(&buf[..len])?;
println!("{}", line);
7 Likes

Right. My mistake.

1 Like

Passing by question: So Rust does not have a C-string? The one that is purely a pointer to char. Does this mean Rust raw strings are optimized for iteration (no branch) but not passing around (twice as heavy)

Rust also has const * which is a plain pointer, but it can only be dereferenced in unsafe contexts, which of course is also true of C strings. There's also CString, and CStr which can be used in safe code

1 Like

(note that &CStr is not compatible with const char *, similarly CString is not safe to directly send across FFI)

From docs of CStr

Note that this structure is not repr(C) and is not recommended to be placed in the signatures of FFI functions. Instead, safe wrappers of FFI functions may leverage the unsafe from_ptr constructor to provide a safe interface to other consumers.

From docs of CString

CString implements a as_ptr method through the Deref trait. This method will give you a *const c_char which you can feed directly to extern functions that expect a nul-terminated string, like C's strdup() . Notice that as_ptr returns a read-only pointer; if the C code writes to it, that causes undefined behavior.

3 Likes

You can pass around references to arrays of u8, i8 or c_char if you like. Just as you can pass pointers in C. Rust strings are a string type to which there's no real equivalent in the C library. People in C usually use another library for Unicode strings or create their own ad-hoc implementation of UTF-8 strings using char arrays and C library functions. The latter isn't ideal.

1 Like

Rust does have C-strings, but they are not native Rust strings and most of Rust's primitives can't work with them without a potentially-recoding conversion.

Rust strings are not self-delimiting, with a terminal 0x00. Thus they are defined by both a starting address and a byte count, which is why unsafe pointers and safe references to Rust strings require a "fat pointer".

C-strings were designed to encode the very limited character set of common US English. As such they cannot encode most of the languages of the world, nor even full US English which uses diaeresis (e.g., the accented i in naïve). Only 36% of the world uses a Roman character set, and much of that use requires accented vowels and consonants that are not ASCII and thus can't be represented by C-string's char.

Rust strings are UTF-8, which can encode virtually all the written languages of the world. That means that most characters in world languages encode in more than one byte, and that a pointer to a specific byte of a Rust string might be pointing to the inside of a multi-byte Rust char.

2 Likes

It also means that constructing substrings and querying the length become constant time instead of linear time in C.

4 Likes

And on top of that (or underneath that?) creating substrings is zero-copy.

3 Likes