Slices why can't I use just one number?

fn main()
{
    let word = String::from("Hello World");
    println!("{}", &word[0..2]);
}

So this code works fine, it gets elements from 0 to 1.

But if I want to get only one character, println!("{}", &word[0]); Why doesn't this work? I understand I can type println!("{}", &word[0..=0]); But why doesn't &word[0] work?

For all the reasons given to you in your other thread where you asked about indexing strings:

1 Like

I thought this is indexing as it points from index 0 to 1?

This is a different case this time compared to the other post. as I am confused the difference between [0..=0] and [0].

It’s byte indexing, and you have to include the entirety of any multibyte sequence to avoid a panic. This panics, for example:

fn main()
{
    let word = String::from("👋 Hello, World!");
    println!("{}", &word[0..2]);
}

As to why you can’t use a single number index, it simply isn’t defined in the standard library. You might be able to make a case for adding it, since I can only think of one reasonable interpretation: the multibyte sequence that starts at the given byte position. (i.e. word[x..].chars().next().unwrap())

2 Likes

[0 ..= 0]

[start ..{=} end] is a subslicing operation , meaning that it yields a &str, that is, a &[u8] (sub)slice of bytes that is valid UTF-8, panicking if the indices are not at "char" (Unicode code points) boundaries (byte indexing!).

[0]

[idx] would be an ambiguous operation.

  • I think you would intend it to be the chars.nth(idx) operation, which does not operate in constant-time, thus contradicting Rust design choice of making (potentially) expensive operations visible.

  • Given that, I personally would expect [0]-like indexing to be .as_bytes()[0], since that is a constant-time operation.
    But then if you did println!("{}", "*"[0]); you would be getting a perhaps surprising result of it displaying 42.

  • EDIT: there is also the option of it yielding [idx ...].chars().next().unwrap(), that is, the char that starts at byte-index idx, but mixing chars and byte-indexing like this is surely a code smell (google for char vs. glyph vs. grapheme cluster).

That's why the choice was made to offer neither, and force people to explicit whether they want the .chars.nth(idx) operation or the .as_bytes()[idx] one.

3 Likes

Ok, let's start with some code.

fn main()
{
    let word = "안녕하세요";
    println!("{}", &word[1]);
}

What can be printed here?

  • "녕": &str: This is the second character of the string. But since the UTF-8 is a variable-width encoding, to get it we need to decode the whole string from the start. So it's O(n) operation. I think it's too surprising to do O(n) operation on string indexing.

  • '녕': &char: A reference to the second character as a char? It's not possible since the str doesn't stores the actual char in memory. A char in Rust is a 4byte wide type which represents a unicode scalar value, which spans from 0 to 0x10FFFF excluding the surrogate pair range.

  • 149: &u8: It's a second byte of the underlying UTF-8 encoded text. It's possible, but do we really want to get u8 on string indexing?

2 Likes

You didn’t include the interpretation that I would expect, which is an &str that represents the codepoint that starts at byte 1. So str[x] would be exactly equivalent to str[x..x+str.len_of_char_at(x)] for a hypothetical len_of_char_at method: It’s constant time and returns the smallest slicable prefix of a sublice that starts at the same index.

To be fair, I can’t think of many situations I’d want to use this operation; it just seems to be the most natural extension of the existing subslice operation to a single index.

1 Like

I get it now mate thanks :slight_smile:

1 Like

I think str[x] would first and foremost lead to lots of people

  1. questioning why it doesn’t give a char even though it’s only one code-point
  2. writing lots of 𝒪(𝑛²) loops over strings that panic as soon as they encounter anything non-ASCII

I do think there ought to be a simpler way than str[x..x+str[x..].chars().next().unwrap().len_utf8()] though.

Edit: On a second thought, the only advantage that a hypothetical str[x] behaving like the long expression above would have (over str[x..].chars().next().unwrap() producing a copy of the char) is that it allows access through a &mut str. However &mut str is a really inflexible beast anyways. There’s not much you can do apart from make_ascii_uppercase and make_ascii_lowercase (or further splitting / re-slicing and various unsafe operations). Even with unsafe, you can only replace the codepoint with one or multiple codepoints with the same total number of UTF-8 code-units (i.e. bytes).

3 Likes

So what is the actual deal here?

As far as I can make out:

&word[0..=0]

Is a slice operation. It had better include a whole utf-8 sequence else this a panic.

&word[0]

Is an indexing operation and is not defined for String. For all the reasons discussed here and in the OP's other thread on the topic.

1 Like

Before I wanted to know the difference between [0..=0] and [0] that is all (I get it now).

Yeah, I actually wanted to be sure myself :slight_smile:

1 Like

This is an intentionally-created artificial limitation in Rust's interface, because Rust really wants to remind you that these are Unicode strings, and Unicode does not have constant-time "character" indexing. n-th byte or n-th codepoint is not n-th "character".

Lots of programming languages ignore that problem and give you false impression that such indexing is possible, by giving you wrong results that look correct in simple cases like "hello world".

The closest thing to human-recognizable characters that is in Unicode is grapheme cluster, and these are expressed as ranges of code points (or ranges of code units, or ranges of bytes), so that's why range indexing is supported. Grapheme cluster algorithm is big and complex (because sum of complexity of all human languages is complex), so it's not in the standard library, but you can use a 3rd party library to give you ranges of grapheme clusters.

If string[0] was implemented, it would end up reading 1/nth of a character, where for ASCII n=1 so it seems to work, but n can be arbitrarily large.

If you're not interested in Unicode, then you can use string.as_bytes()[0] and that will give you the first byte.

7 Likes

I'm a bit surprised that

    println!("{}", &word[0..2]);

will panic if the slice does not line up with valid UTF-8. Yet there seems to be no way to catch that error if you wanted to.

word.get(0..2).ok_or("you can catch this error")?

This is the same as vec[x] panicking, but vec.get(x) returning Option.

8 Likes

OK, sounds reasonable.

1 Like

What kornel said is the best option, but of course you can also check str::is_char_boundary yourself before indexing the same way you can check the length yourself before indexing.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.