Len and is_char_boundary mismatch

ok so IIUC, in a str len() returns the number of bytes of the string and is_char_boundary(index) verifies if the nth-byte is a codepoint boundary or end of string.

pub fn _str_test1(){
		let s = "ツ";
		let index = 3;
		
		println!("bytes: {}, index: {}, is limit?: {}", s.len(), index, s.is_char_boundary(index));
}

Here len is 3. So shouldn't this return true only when index is 0 and 2?, since 0 is the first byte hence is code boundary and 2 is end of string since from 0 to 2 there are 3 bytes. BUT I get false when index=2 and true when index=3, which should point to nothing.
So what is going on here? Obviously I am missing something.

Thanks!

The character boundaries are the indices at which you can slice the string into characters.

3 is a character boundary because s[..3] contains the first character, and s[3..] contains the rest of the string (which in this case is nothing):

fn main(){
	let s = "ツ";
	println!("{:?}, {:?}", &s[0..3], &s[3..]); // prints "ツ", ""
}

(Playground)

Similarly, the string "ツA" has character boundaries at 0, 3, and 4, because its characters are at s[0..3] and s[3..4].

It's often useful to think of zero-based indices as pointing between the elements of an array, so each element has an index that points to its start and one that points to its end. The length of the array corresponds to an index that points at the end of its last element. So the byte indices of "ツA" could be drawn like this:

+---+---+---+---+
|    TSU    | A |
+---+---+---+---+
0   1   2   3   4

…where the pipe characters (|) represent the character boundaries.

16 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.