Rust disallows random access to strings, but allows slicing strings using usize ranges, which I think is an inconsistent design.
let string = String::from("πππ");
println!("{}", string[1]);
// Compile time error:
// `String` cannot be indexed by `{integer}`
println!("{}", string[1..].chars().next().unwrap())
// Runtime time error:
// panicked at 'byte index 1 is not a char boundary; it is inside 'π' (bytes 0..4) of `πππ`'
Rust guarantees that people cannot access strings with usize at the language level (type system); but
Rust only guarantees the slice of strings at runtime.
I don't understand this design, it doesn't seem well thought out. I often see people using this feature to disguise random access to strings, even Rustc parser itself.
In Swift, strings have their own index type String.Index, which is a safe wrapper of the int index.
let string = "πππ"
print(string[1])
// Compile time error:
// 'subscript(_:)' is unavailable: cannot subscript String with an Int, use a String.Index instead.
print(string[1...].first)
// Compile time error:
// 'subscript(_:)' is unavailable: cannot subscript String with an integer range, use a String.Index range instead.
let one: String.Index = string.index(after: string.startIndex)
print(string[one]) // => π
print(string[one...].first!) // => π
Rust's String doesn't have its own, storable index type, so we had to use some idioms:
When we need to store a String.Index, we use a std::str:Chars;
When we need a slice, we use the usize, which is not safe;
It's really a bit confusing, I need a reasonable explanation from the designer's point of view.
Because not all byte indices are char boundaries, and slicing within a multi-byte codepoint would result in invalid utf-8, this restriction is checked at runtime. An invalid index results in a panic, which is safe.
Indexing a &str does not make sense because it is not an array. The atoms are char and &str is not a sequence of chars.
string[n] is does not work because it is not implemented. It could be implemented to mean decode the codepoint starting at n and return it as char. But it isn't.
string[n..m] is implemented and requires runtime checking to ensure soundness.
string[n] cannot be implemented to return char. A index expression is a place expression (lvalue if you used to this term) which means it should be possible to take a reference of it. A str doesn't holds char, which is a unicode scalar value, as-is.
Swift has one of the most interesting approach on this category. A Character in swift literally is a struct with String field as it represents a single extended grapheme cluster.
No I don't. There's no inconsistency. For starters, you assume these two are "the same thing". They aren't.
A string may have substrings, which are also strings. Therefore, str[start..end], when valid, yields the same type with the same (equally valid) meaning. You might supply an invalid range, but the operation makes sense in general.
In contrast, str[single_index]is not meaningful. A string is defined to be UTF-8, therefore its semantic building blocks aren't merely single bytes, they are variable-length grapheme clusters, which are themselves strings. Therefore, there is no good answer to return when indexing with a single integer:
if the character at that index happens to be encoded in a single byte, then what you really assume is that the string is ASCII and you should be operating on the underlying [u8] instead.
if the character is multi-byte, then indexing could either panic (which is useless since it would then always panic as we have established that you shouldn't use it for accessing ASCII byte strings), or it could return a substring of more than one byte, which would be counter-intuitive since you only asked for "one" element.
I can assure you strings have received an enormous amount of thought from the language team.
This is because ranges give strings, and there are useful use-cases for this (like splitting into substrings, working with char_indices).
Indexing could only give UTF-8 code units, which are generally useless, and exposing this has huge potential for misleading users into thinking these are "characters" (they very much aren't). Operations that are actually fine with using bytes, and don't want to work with characters are better done via .as_bytes().
Basically str[n] is not a correct way to read a "character", and Rust is intentionally designed to break code making such bad assumption.
let emoji = "π₯Ίππ"
let ascii = "bottom"
let ix = ascii.index(after: ascii.index(after: ascii.startIndex))
print(emoji[ix])
print(ascii[ix])
This is perfectly allowed, doesn't crash, and prints
π₯Ί
t
String.Index is a nice bit of salt to make people think about what they really want (and grapheme clusters by default is nice for a UX focused language). But it's not perfect (and it can't be!) as the above should show.
Swift tries to do the most correct thing it can by default.
Rust refuses to let you do anything unsafe, but it's perfectly happy to require you to know what you actually want and actually do that.
But then it's not true for Swift String.Index, either⦠in the very example you showed, it cuts the "tsunami" code unit in half. It just outputs garbage instead of panicking, which is (IMPO) strictly worse than panicking.
string[usize..usize] in Rust can't chop up characters (it panics instead). To do otherwise would be unsafe, because strs must be valid UTF-8.
This is usually not really a concern because byte indices are not usually arbitrary numbers, they come from earlier processing of the string and are thus "known good". But if you make a mistake, you get a panic instead of a corrupted string.
When I saw this topic was still active, I thought "the fact I still don't even know exactly what they wish existed is a good argument for string[n] being non-intuitive and thus not in std." Then I thought "no, it's probably you who weren't paying enough attention," so I went back and read everything and... I'm afraid I really don't know what you wish existed! These aren't the same for example:
(Also, the following isn't true, or you wouldn't have to care about panics.)
[string[n..m] in Rust] treat[s] the string as a Vec<u8>;
If you can be precise about what you want, it may be possible via a new type. Or perhaps you're looking for the unicode-segmentation crate, from which you can create a Vec of grapheme strs. Or a combination thereof.
Hmm, now I'm puzzled, though, since those are different things. For example, Range<usize> indexing cannot cut inside a code point (it panics instead) but is happy to cut inside a grapheme so long as it's a codepoint boundary. And doesn't your \237 (meaningless output) example mean that Swift does allow cutting inside codepoints? (And thus inside grapheme clusters too.)
I don't think the OP is asking for anything in particular. Rather, they're pointing out what is, in a narrow sense, a contradiction.
One the one hand, the language allows you to index raw bytes of a str for slices. It's expected that you have already verified the new slice will be meaningful, otherwise you will receive a panic. Yet the same is not true for single byte access. Even if you had already checked that the str contained only ASCII, you're still forbidden from indexing it as such.
I think the OP is suggesting that it would be more consistent to do one of two things. Either allow direct indexing of individual bytes, also with panic on invalid index. Or instead have an API that forces more checking at compile-time. For example, perhaps you could only index str with some newtype wrapper that that can only be derived by finding valid boundaries within the str first.
Others have already summarized arguments for the current behavior. I'm just trying to clarify what I think is OP's intent.
I don't personally see inconsistency in the design for Rust here. Rust's standard library simply doesn't impose any default way of segmenting strings. That's why only ranges work for slicing. The restriction for ranges to panic when UTF-8 encodings of individual unicode scalar values would be split is merely for soundness (because a str must not begin / end in such a place).
One secondary reason why string[n] doesn't work is: suppose we did want to - by default - include the whole scalar value after the given (byte-) index. But then people might expect to get a char from such an operation (as you'd get from string[n]. chars().next().unwrap()). Since indexing in Rust must refer to an existing place, all that could be returned would be a str containing a single char.
There could be an argument for disallowing string[n..m], too: you might not want to impose a default way of indexing (i. e. by bytes vs by scalar value vs by graoheme cluster, etc.) However, choosing a default here (indexing by bytes) makes sense because it's the only thing that can be done efficiently.
I don't have a good picture of what this Swift API that's discussed here does, but it feels like there's no problem in implementing such a string-index type in a crate for Rust, too, with similar API. As far as I understood, it involves splitting by grapheme clusters, which is something that Rust standard library doesn't support, so an external crate seems appropriate.
This would be nice. But it seems hard to make work with borrow checker rules.
I've run into the same issue in other contexts: if the validated index stores a reference to the String, you can't pass it to a mutating function because that requires you to take a mutable reference to the String before calling the function that consumes the index.
It seems like there could be a way to loosen borrow checker rules somehow to allow this sort of thing.