What is so special in string slices?

If string slice is slice why &[ char ] is not definition of its type?

Why that code does not compile?

fn main()
{ 
  assert_eq!( is_slice( &[ 1 ,2 ,3 ] ), true );
  assert_eq!( is_slice( "abc" ), true );
}

fn is_slice< T >( _ : &[ T ] ) -> bool
{
  true
}

What is so special in string slices?

Playground

str is essentially a newtype over [u8] -- it's UTF-8, very intentionally not [char].

6 Likes

But why?
Is that correct that str is &[ char ], but because of some reason compiler distinct it?

No. They are two different things.

A &str is a slice of bytes (u8) that use the variable-length UTF-8 encoding (i.e. each character may take up 1, 2, 3, or 4 bytes).

A &[char] is a slice of chars, where each char is a 4-byte "unicode code point".

Despite what most common programming languages would have you believe, a string isn't just an array of characters. To properly explain the distinction would require going down a very deep rabbit hole and I'll probably butcher it completely. So instead, I'll refer you to this excellent explanation from Tom Scott:

You can't even assume that a single code point correlates to a single glyph (very loosely - the "letter" you would draw) because you've got things like accents and skin tone modifiers which can be added after another code point to alter it.

6 Likes

Thanks for answer. I was not aware char is 4-bytes long. That's a discovery for me, thanks.
I am aware of "Unicode code points" and logic behind that.

In the core str is &[ u8 ], but only reason coercion between two equivalent data types is needed because of Unicode logic, which str should follow.

Is that correct? Or there is something?

Even if they have identical binary representations, there are a couple reasons why &str and &[u8] are two different types:

  1. A bunch of bytes and a string are two different logical concepts, and making them different types means you can use the type system to ensure they don't get accidentally mixed up
  2. You can give the str type methods specific to a string (e.g. lines() and trim_whitespace()) and implement the Display trait
  3. The str type uses unsafe code internally so if you provide direct access to the underlying bytes, users may accidentally break str's assumptions and cause UB (e.g. you modified the last byte to look like a multi-byte character and now some unsafe code will read past the end of the string)

There is similar logic for why std::path::Path is a different type to str, even though most languages are happy to treat strings and file paths as the same thing.

8 Likes

Thanks for the detailed explanation, @Michael-F-Bryan

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.