How to print out part of a string literal

fn main()
{
    let x = "Testing";

    println!("{}", x[1]); // Error here
}

Why can't I print out the index of a String literal? In Python this is possible but why not Rust? How can I print part of a String literal?

Strings can't be indexed because of UTF-8 formatting that makes the individual characters of different lengths in memory, and therefore have to be calculated procedurally.
Instead you can iterate over a string's chars like so:

let x = "Testing";
for i in x.chars() {
    //
}

and to do what you wanted to do originally, you can do this:

let x = "Testing";
for (index, ch) in x.chars().enumerate() {
    if index == 1 {
        println!("{}", ch);
        break;
    }
}

Additionally you can collect them into a Vec<char> like so:

let x = "Testing";
let v = x.chars().collect::<Vec<char>>();
1 Like

So what characters can take up more memory compared to other characters? So in Python why can the characters be indexed? Is it cause it is not using UTF-8?

Thanks for providing the solution :slight_smile:

Well, as explained here a regular string in python is ascii where every character is exactly one byte long, while UTF-8 (Otherwise known as unicode) can contain things like emojis or non-latin characters

2 Likes

Oh I see, thanks for that.

That is only true for Python 2's str type (known as bytes in Python 3). Python 3's str type (known as unicode in Python 2) also allows indexing of individual characters.

Resources on Python claim that python stores "each code point" separately. I am, however, having a difficult time telling whether that means it is encoded in UTF-32 (or similar), or if, more likely, those resources are incorrect and it is encoded in WTF-16.

In either case, I imagine that python simply returns a substring from the ith element to the i+1th element, whether that substring is well-formed or not.

3 Likes

This is covered clearly and directly in the book: Storing UTF-8 Encoded Text with Strings - The Rust Programming Language

2 Likes

If I recall correctly, Python changes the layout of strings on the fly depending on how many bytes each code point will fit into. So if a string only contains Latin-1, it uses one byte units. If a string contains Japanese text, it's probably two byte units. If it contains more exotic text, probably four byte units.

But keep in mind that codepoints are not characters. Characters can be comprised of an arbitrary number of codepoints. In addition, whether a sequence of codepoints counts as a single symbol can depend on your operating system and what font is being used.

5 Likes

and on the geopolitical state of the world:

:us: :greece: :fr: :jp:

Edit: Sadly it seems chrome on windows always treats pairs of these codepoints as one character even if invalid (e.g. 🇪🇧 <-- try selecting one of the two characters in that). Which is less exciting.

3 Likes

Instead of this, you can do

let nth_char: Option<char> = x.chars().nth(index);

Iterators are amazing!

5 Likes

Firefox treats it as two. So it depends on your OS, font and browser.

Also, don't forget that things like the non-standard cat ninja emoji on Windows 10 are things, so you can't even rely on the official, canonical definition of what symbols exist.

Because text wasn't hard enough already...

2 Likes

You can instead request a one-encoding-unit-long substring:

println!("{}", x[1..2]);

Note, however, that this will helpfully error if you try to index inside a single unicode-scalar-value, so if you want to do text "properly", you want one of the other things people have mentioned in this thread.

2 Likes

As you may understand now, direct string indexing is complex, even more complex than most of people can imagin. This is why Rust decided not to support it.

3 Likes