fn main()
{
let x = "Testing";
println!("{}", x[1]); // Error here
}
Why can't I print out the index of a String literal? In Python this is possible but why not Rust? How can I print part of a String literal?
fn main()
{
let x = "Testing";
println!("{}", x[1]); // Error here
}
Why can't I print out the index of a String literal? In Python this is possible but why not Rust? How can I print part of a String literal?
Strings can't be indexed because of UTF-8 formatting that makes the individual characters of different lengths in memory, and therefore have to be calculated procedurally.
Instead you can iterate over a string's chars like so:
let x = "Testing";
for i in x.chars() {
//
}
and to do what you wanted to do originally, you can do this:
let x = "Testing";
for (index, ch) in x.chars().enumerate() {
if index == 1 {
println!("{}", ch);
break;
}
}
Additionally you can collect them into a Vec<char>
like so:
let x = "Testing";
let v = x.chars().collect::<Vec<char>>();
So what characters can take up more memory compared to other characters? So in Python why can the characters be indexed? Is it cause it is not using UTF-8?
Thanks for providing the solution
Well, as explained here a regular string in python is ascii where every character is exactly one byte long, while UTF-8 (Otherwise known as unicode) can contain things like emojis or non-latin characters
Oh I see, thanks for that.
That is only true for Python 2's str
type (known as bytes
in Python 3). Python 3's str
type (known as unicode
in Python 2) also allows indexing of individual characters.
Resources on Python claim that python stores "each code point" separately. I am, however, having a difficult time telling whether that means it is encoded in UTF-32 (or similar), or if, more likely, those resources are incorrect and it is encoded in WTF-16.
In either case, I imagine that python simply returns a substring from the i
th element to the i+1
th element, whether that substring is well-formed or not.
This is covered clearly and directly in the book: Storing UTF-8 Encoded Text with Strings - The Rust Programming Language
If I recall correctly, Python changes the layout of strings on the fly depending on how many bytes each code point will fit into. So if a string only contains Latin-1, it uses one byte units. If a string contains Japanese text, it's probably two byte units. If it contains more exotic text, probably four byte units.
But keep in mind that codepoints are not characters. Characters can be comprised of an arbitrary number of codepoints. In addition, whether a sequence of codepoints counts as a single symbol can depend on your operating system and what font is being used.
and on the geopolitical state of the world:
Edit: Sadly it seems chrome on windows always treats pairs of these codepoints as one character even if invalid (e.g. 🇪🇧 <-- try selecting one of the two characters in that). Which is less exciting.
Instead of this, you can do
let nth_char: Option<char> = x.chars().nth(index);
Iterators are amazing!
Firefox treats it as two. So it depends on your OS, font and browser.
Also, don't forget that things like the non-standard cat ninja emoji on Windows 10 are things, so you can't even rely on the official, canonical definition of what symbols exist.
Because text wasn't hard enough already...
You can instead request a one-encoding-unit-long substring:
println!("{}", x[1..2]);
Note, however, that this will helpfully error if you try to index inside a single unicode-scalar-value, so if you want to do text "properly", you want one of the other things people have mentioned in this thread.
As you may understand now, direct string indexing is complex, even more complex than most of people can imagin. This is why Rust decided not to support it.