Why String can be sliced with usize index?

Rust disallows random access to strings, but allows slicing strings using usize ranges, which I think is an inconsistent design.

let string = String::from("🌊🌊🌊");

println!("{}", string[1]);
// Compile time error:
// `String` cannot be indexed by `{integer}`

println!("{}", string[1..].chars().next().unwrap())
// Runtime time error:
// panicked at 'byte index 1 is not a char boundary; it is inside '🌊' (bytes 0..4) of `🌊🌊🌊`'
  • Rust guarantees that people cannot access strings with usize at the language level (type system); but
  • Rust only guarantees the slice of strings at runtime.

I don't understand this design, it doesn't seem well thought out. I often see people using this feature to disguise random access to strings, even Rustc parser itself.

In Swift, strings have their own index type String.Index, which is a safe wrapper of the int index.

let string = "🌊🌊🌊"

print(string[1])
// Compile time error:
// 'subscript(_:)' is unavailable: cannot subscript String with an Int, use a String.Index instead.

print(string[1...].first)
// Compile time error:
// 'subscript(_:)' is unavailable: cannot subscript String with an integer range, use a String.Index range instead.

let one: String.Index = string.index(after: string.startIndex)
print(string[one]) // => 🌊
print(string[one...].first!) // => 🌊

Rust's String doesn't have its own, storable index type, so we had to use some idioms:

  • When we need to store a String.Index, we use a std::str:Chars;
  • When we need a slice, we use the usize, which is not safe;

It's really a bit confusing, I need a reasonable explanation from the designer's point of view.

&str is a slice of UTF-8 and indexed by bytes.

Because not all byte indices are char boundaries, and slicing within a multi-byte codepoint would result in invalid utf-8, this restriction is checked at runtime. An invalid index results in a panic, which is safe.

Indexing a &str does not make sense because it is not an array. The atoms are char and &str is not a sequence of chars.

1 Like

I need to describe my problem more clearly.

My question is not "Why random access to strings is not allowed?", and I'm really familiar with the Unicode standard. My point is the "inconsistency".

  • string[n] issues are reported at compile time;
  • string[n..m] issues are reported at runtime;

Do you see the problem, the same thing, completely different behavior.

What is the difference between string[n] and string[n..].chars().next().unwrap(), and why should the former be opposed and the latter be advocated?

My question is:

  • If the designers believe that the runtime check is good enough, why we don't have string[n] with runtime checks?
  • If the designers believe the compile-time check is important, why would they allow string[n..m]?
1 Like

string[n] is does not work because it is not implemented. It could be implemented to mean decode the codepoint starting at n and return it as char. But it isn't.

string[n..m] is implemented and requires runtime checking to ensure soundness.

1 Like

Actually the Index trait does not allow returning a char, so that isn't even possible.

1 Like

string[n] cannot be implemented to return char. A index expression is a place expression (lvalue if you used to this term) which means it should be possible to take a reference of it. A str doesn't holds char, which is a unicode scalar value, as-is.

Swift has one of the most interesting approach on this category. A Character in swift literally is a struct with String field as it represents a single extended grapheme cluster.

10 Likes

No I don't. There's no inconsistency. For starters, you assume these two are "the same thing". They aren't.

A string may have substrings, which are also strings. Therefore, str[start..end], when valid, yields the same type with the same (equally valid) meaning. You might supply an invalid range, but the operation makes sense in general.

In contrast, str[single_index] is not meaningful. A string is defined to be UTF-8, therefore its semantic building blocks aren't merely single bytes, they are variable-length grapheme clusters, which are themselves strings. Therefore, there is no good answer to return when indexing with a single integer:

  • if the character at that index happens to be encoded in a single byte, then what you really assume is that the string is ASCII and you should be operating on the underlying [u8] instead.
  • if the character is multi-byte, then indexing could either panic (which is useless since it would then always panic as we have established that you shouldn't use it for accessing ASCII byte strings), or it could return a substring of more than one byte, which would be counter-intuitive since you only asked for "one" element.

I can assure you strings have received an enormous amount of thought from the language team.

7 Likes

This is because ranges give strings, and there are useful use-cases for this (like splitting into substrings, working with char_indices).

Indexing could only give UTF-8 code units, which are generally useless, and exposing this has huge potential for misleading users into thinking these are "characters" (they very much aren't). Operations that are actually fine with using bytes, and don't want to work with characters are better done via .as_bytes().

Basically str[n] is not a correct way to read a "character", and Rust is intentionally designed to break code making such bad assumption.

6 Likes

Forgive my horrible Swift but what about

let emoji = "πŸ₯ΊπŸ‘‰πŸ‘ˆ"
let ascii = "bottom"
let ix = ascii.index(after: ascii.index(after: ascii.startIndex))
print(emoji[ix])
print(ascii[ix])

This is perfectly allowed, doesn't crash, and prints

πŸ₯Ί
t

String.Index is a nice bit of salt to make people think about what they really want (and grapheme clusters by default is nice for a UX focused language). But it's not perfect (and it can't be!) as the above should show.

Swift tries to do the most correct thing it can by default.

Rust refuses to let you do anything unsafe, but it's perfectly happy to require you to know what you actually want and actually do that.

6 Likes

Well, I agree with part of your point:

  • string[n] has no meaning, we don't need it;
  • A string may has substrings, we need a function to slice it;

But, does the parameter of slicing really have to be the range of usize?

I insist that string[n] and string[n..m] are "the same thing", they have a strong connection.

  • They both treat the string as a Vec<u8>;
  • They can easily be implemented with each other, for example string[n..].chars().next().unwrap() or string[n..].bytes().next().unwrap();
  • They are very common APIs. If they are not safe, beginners may abuse them;

The Swift example I gave above seems to me to be more consistent.

Rust Swift May cut inside a code point / cluster?
string[usize] No No Yes
string[usize..usize] Yes No Yes
string[String.Index] - Yes No
string[String.Index..String.Index] - Yes No

I wish Rust had a mechanism like String.Index.

Just to be fair (not relevant to the discussion), Swift's String.Index has its own problems, it doesn't know exactly which string it is indexing.

let s1 = "abcdef"
let s2 = "🌊🌊🌊"

let index = s1.index(after: s1.startIndex)
print(s2[index]) // => \237 (meaningless output)
2 Likes

Note that "safe" has a very particular meaning in Rust. All of these things would be safe under the Rust definition of the word.

So please elaborate more on what you mean by that column, for a post here.

2 Likes

You're right, I changed it to "May cut inside a code point / cluster?".

2 Likes

But then it's not true for Swift String.Index, either… in the very example you showed, it cuts the "tsunami" code unit in half. It just outputs garbage instead of panicking, which is (IMPO) strictly worse than panicking.

2 Likes

string[usize..usize] in Rust can't chop up characters (it panics instead). To do otherwise would be unsafe, because strs must be valid UTF-8.

This is usually not really a concern because byte indices are not usually arbitrary numbers, they come from earlier processing of the string and are thus "known good". But if you make a mistake, you get a panic instead of a corrupted string.

3 Likes

When I saw this topic was still active, I thought "the fact I still don't even know exactly what they wish existed is a good argument for string[n] being non-intuitive and thus not in std." Then I thought "no, it's probably you who weren't paying enough attention," so I went back and read everything and... I'm afraid I really don't know what you wish existed! These aren't the same for example:

(Also, the following isn't true, or you wouldn't have to care about panics.)

  • [string[n..m] in Rust] treat[s] the string as a Vec<u8>;

If you can be precise about what you want, it may be possible via a new type. Or perhaps you're looking for the unicode-segmentation crate, from which you can create a Vec of grapheme strs. Or a combination thereof.

3 Likes

Hmm, now I'm puzzled, though, since those are different things. For example, Range<usize> indexing cannot cut inside a code point (it panics instead) but is happy to cut inside a grapheme so long as it's a codepoint boundary. And doesn't your \237 (meaningless output) example mean that Swift does allow cutting inside codepoints? (And thus inside grapheme clusters too.)

1 Like

I don't think the OP is asking for anything in particular. Rather, they're pointing out what is, in a narrow sense, a contradiction.

One the one hand, the language allows you to index raw bytes of a str for slices. It's expected that you have already verified the new slice will be meaningful, otherwise you will receive a panic. Yet the same is not true for single byte access. Even if you had already checked that the str contained only ASCII, you're still forbidden from indexing it as such.

I think the OP is suggesting that it would be more consistent to do one of two things. Either allow direct indexing of individual bytes, also with panic on invalid index. Or instead have an API that forces more checking at compile-time. For example, perhaps you could only index str with some newtype wrapper that that can only be derived by finding valid boundaries within the str first.

Others have already summarized arguments for the current behavior. I'm just trying to clarify what I think is OP's intent.

8 Likes

I don't personally see inconsistency in the design for Rust here. Rust's standard library simply doesn't impose any default way of segmenting strings. That's why only ranges work for slicing. The restriction for ranges to panic when UTF-8 encodings of individual unicode scalar values would be split is merely for soundness (because a str must not begin / end in such a place).

One secondary reason why string[n] doesn't work is: suppose we did want to - by default - include the whole scalar value after the given (byte-) index. But then people might expect to get a char from such an operation (as you'd get from string[n]. chars().next().unwrap()). Since indexing in Rust must refer to an existing place, all that could be returned would be a str containing a single char.

There could be an argument for disallowing string[n..m], too: you might not want to impose a default way of indexing (i. e. by bytes vs by scalar value vs by graoheme cluster, etc.) However, choosing a default here (indexing by bytes) makes sense because it's the only thing that can be done efficiently.


I don't have a good picture of what this Swift API that's discussed here does, but it feels like there's no problem in implementing such a string-index type in a crate for Rust, too, with similar API. As far as I understood, it involves splitting by grapheme clusters, which is something that Rust standard library doesn't support, so an external crate seems appropriate.

3 Likes

This would be nice. But it seems hard to make work with borrow checker rules.

I've run into the same issue in other contexts: if the validated index stores a reference to the String, you can't pass it to a mutating function because that requires you to take a mutable reference to the String before calling the function that consumes the index.

It seems like there could be a way to loosen borrow checker rules somehow to allow this sort of thing.

This is actually quite terrifying. Instant UB right there - example:

let s1 = "abcdef"
let s2 = "🌊🌊🌊"

let index = s1.index(s1.startIndex,offsetBy: 4)
let s3 = String(s2[index])
print(s3.utf16.count)

results in:

10 Likes