Noob question: why "&str"?

Noob question (from C++ background):

I'm quite confused with the type &str, my understanding is that str is an unsized type, and we can only interact with it by the "reference form", i.e. &str. In addition, the type &str is not really a reference/pointer (or borrow, in rust's word), but rather a "view". std::mem::size_of::<&str>() gives 16 rather than 8 (for most borrow/ref types) as it has a pointer as well as a length stored inside.

My rant: why "&str"? Why not "str_view"? All the unique behaviors of "&str" compared to some general borrow type (e.g. "&i32") indicate that it is really not a borrow/reference, but a "view" type. The size_of==16 thing clear says that... It keeps me from seeing rationales in rust primitive types by using "&str" as the type name.

I guess the array slice/view &[T] is similar, but to keep things simple, I'll skip my rant over that.

Could you help explaining the rationales behind this? I find myself hard to get over it... I suspect it's due to some stubborn C++ mindset, but it's really hard for me to realize what exactly it is...

1 Like

Isn't it better to have &str, &mut str, Box<str>, Cow<'_, str>, *mut str, *const str instead of str_view, str_mut_view, str_boxed, str_cow, str_ptt, str_mut_ptr and duplicate all of them for slice([T])?

In fact, the str itself really is a type. But as its size is dynamically determined, we need additional size information to represent its pointer. Note that size_of::<*mut str>() == 2 * size_of::<usize>()

4 Likes

&str doesn't really have unique behaviour, all references and pointers to unsized values include extra information (either a length for slice-like unsized types or a method table for trait-object-like types). It is written as a reference because it is a reference, and it follows all the usual rules for borrowing. It would be somewhat surprising if string_view behaved like a reference, and writing &string_view just gets back to a slightly longer form of &str (str is already a view of a string - unsized types are just views of sized types that erase some type information). Also, having the reference written explicitly allows for other forms of pointer to str (which @Hyeonu listed). All of these pointers to str (which each give different guarantees) include its length because str is unsized. The slice type [T] behaves similarly.

If you want to read more on the details of unsized types (more detail than is required for most safe Rust programming), you might find this section of the nomicon interesting.

2 Likes

You talk about all the unique behaviors of &str compared to e.g. &i32 but they are actually quite similar.

I mean, for one they both reference some sort of object in memory, and are borrows of that object with lifetimes to ensure it lives long enough. Sure one of them is larger than the other, but this also happens for trait objects such as &dyn Trait that need to keep a vtable around. Arguably this isn't a pointer either as it has extra information, but we still call it a fat pointer (even in C++).

3 Likes

I'm new to Rust so many things confuse me including strings. But then strings are confusing in pretty much all languages, especially C++.

The way I see it is we want to be able to write things like:

let hello_world = "Hello, world!";

So what type is 'hello_world'?

We clearly want it to be some kind of string. OK let's call it 'str' that is nice and succinct and a well known abbreviation.

But we don't want to pass 'str' around as a value, that could be a lot of bytes, so it had better be some kind of reference, a pointer thing pointing to the actual bytes:

let hello_world: &str = "Hello, world!";

Of course in Rust we have lifetimes to think about so the full type we need is:

let hello_world: &'static str = "Hello, world!";

As explained here:
https://doc.rust-lang.org/std/str/

Where it clearly states &str is a slice, presumably with a pointer and a length or perhaps a begin and end pointer, whatever a 'fat pointer' as noted above.

All in all strings in Rust end up being pretty sensible compared to C++. Especially if Unicode is involved.

4 Likes

Not a "noob" question at all: This is a really interesting and non-trivial language design trade-off!

This is all correct, although "not really a reference/pointer" depends on interpretation. &str is really a shared reference to a str in terms of the borrow checker, and the fact that dereferencing it produces a str value, and the fact that borrowing a str value produces a &str value, and so on. But it's also true that &str is a "fat pointer" / has a larger memory footprint than a raw pointer.

The basic idea is that Rust references provide more safety and correctness guarantees than C++ references, and providing those guarantees often requires carrying around some metadata in addition to the familiar pointer. For view types, that metadata is just a length, and for trait objects that metadata is a vtable, but the principle is the same.

The details include but are not limited to:

  • "object slicing" is impossible in Rust. The equivalent syntax simply won't compile, because Rust expects you to go through trait objects if you really want runtime polymorphism.
  • The weirdly subtle rule for when a destructor has to be virtual doesn't exist in Rust, because trait objects are a separate type provided by the Rust language. You don't need to and can't declare or implement a trait object's drop() method yourself, and thus you can't possibly get it wrong.
  • The whole virtual inheritance thing is basically an ad hoc solution to "how do we prevent multiple inheritance from slicing the base object?". Again, Rust trait objects don't have this problem because they neither allow nor require you to muck with their implementation.
  • Like in C++, view types such as &[T] and &str keep the pointer and the length together, which makes it possible to do basic operations like indexing safely.
  • Unlike in C++, view types such as &[T] and &str obey all of the usual borrow checking rules in Rust, so the equivalent of a dangling/invalidated string_view is guaranteed to be a compile-time error.
  • String literals are a safe &str in Rust, rather than an unsafe const char* in C++.

AFAIK, the fact that C++ implemented string_view as a library rather than a core language feature is not because they believe a library solution is better, but because backwards compatibility rules out any good core language solutions. For example, C++ can never change string literals to be string_views.

The only thing that's actually unique about the &str type in particular is not that it's a reference instead of a library type, or a fat pointer instead of a thin pointer, but that it's a distinct type from &[u8]. And that's "only" because &str guarantees that it's valid UTF-8. In this respect, C++ will actually be adding its own equivalents in the future (the proposals I've seen call them std::text and std::text_view).

10 Likes

In terms of naming I think calling it a "string view" would be super helpful for novice users. Lots of noobs see "hello world" is a &str and assume that's something like a String* type and get lost in a borrow checker swamp.

As to why it's actually called like that, you can probably find the answer somewhere in old issues or commits on GitHub. I wouldn't be surprised if it was just an accident. Rust 0.x used to have super short keywords like ret for return, and it had ~[T] for owned and &[T] for borrowed types, so &str may have survived from that era, unlike other syntax and names.

2 Likes

I recall seeing several Rust team members lamenting that String and str should have been called StringBuf and String (like PathBuf and Path) with the benefit of hindsight.

5 Likes

Then there's also the issue of Deref. If slices and string slices weren't pointers, how would you impl Deref<Target=???> for types like Vec and String? In general, how would you deal with &T in a generic manner if you had to always special-case when T == [U] or T == str?

1 Like

Just for completeness: The official C++ answer is probably type traits. Behold std::pointer_traits.

Hopefully it's already clear why I think Rust's fat pointers are a better answer.

1 Like

Fair enough. As usual, the C++ solution is not to make stuff uniform; instead, it is to add even more non-uniform stuff to handle bad insights of the past.

2 Likes

Thank you all for the excellent explanations! Now I start to get it. I think the stubborn C++ mindset in me is that "a reference/borrow has to be a (thin) pointer, and that's it". This view is not general enough to understand borrow in Rust. With all the traits/fat point explanations now I can see "&str" is not a unique thing -- it follows all the borrow checker rules just the same as all the borrows.

My takeaways:

  1. Borrow is not a C++ reference / (thin) pointer.
  2. Borrow is more of a semantic concept. Focusing on the memory layout will fallback to the old C legacies and that's view limiting.
8 Likes

To add a little more color, let's also consider that &str is not actually a reference to a String. It's a reference to a contiguous sequence of bytes that are guaranteed (in safe rust) to be valid utf8. That may be a String, which allocates its data on the heap , but it could also be a type from a third party crate like smallvec, or bytes compiled intoyour binary, as in the case of str literals, or even -(with the help of a little unsafe rust)- a fixed size array.

So there's no real way to look at &str as a view into some other particular owning type.

4 Likes

No unsafe code required for stack allocated &str

let mut buf = [0; 1024];
let len = std::io::stdin().read(&mut buf)?;
let line = std::str::from_utf8(&buf[..len])?;
println!("{}", line);
7 Likes

Right. My mistake.

1 Like

Passing by question: So Rust does not have a C-string? The one that is purely a pointer to char. Does this mean Rust raw strings are optimized for iteration (no branch) but not passing around (twice as heavy)

Rust also has const * which is a plain pointer, but it can only be dereferenced in unsafe contexts, which of course is also true of C strings. There's also CString, and CStr which can be used in safe code

1 Like

(note that &CStr is not compatible with const char *, similarly CString is not safe to directly send across FFI)

From docs of CStr

Note that this structure is not repr(C) and is not recommended to be placed in the signatures of FFI functions. Instead, safe wrappers of FFI functions may leverage the unsafe from_ptr constructor to provide a safe interface to other consumers.

From docs of CString

CString implements a as_ptr method through the Deref trait. This method will give you a *const c_char which you can feed directly to extern functions that expect a nul-terminated string, like C's strdup() . Notice that as_ptr returns a read-only pointer; if the C code writes to it, that causes undefined behavior.

3 Likes

You can pass around references to arrays of u8, i8 or c_char if you like. Just as you can pass pointers in C. Rust strings are a string type to which there's no real equivalent in the C library. People in C usually use another library for Unicode strings or create their own ad-hoc implementation of UTF-8 strings using char arrays and C library functions. The latter isn't ideal.

1 Like