Len and is_char_boundary mismatch

juanito · August 16, 2021, 12:23am

ok so IIUC, in a str len() returns the number of bytes of the string and is_char_boundary(index) verifies if the nth-byte is a codepoint boundary or end of string.

pub fn _str_test1(){
		let s = "ツ";
		let index = 3;
		
		println!("bytes: {}, index: {}, is limit?: {}", s.len(), index, s.is_char_boundary(index));
}

Here len is 3. So shouldn't this return true only when index is 0 and 2?, since 0 is the first byte hence is code boundary and 2 is end of string since from 0 to 2 there are 3 bytes. BUT I get false when index=2 and true when index=3, which should point to nothing.
So what is going on here? Obviously I am missing something.

Thanks!

mbrubeck · August 16, 2021, 12:36am

The character boundaries are the indices at which you can slice the string into characters.

3 is a character boundary because s[..3] contains the first character, and s[3..] contains the rest of the string (which in this case is nothing):

fn main(){
	let s = "ツ";
	println!("{:?}, {:?}", &s[0..3], &s[3..]); // prints "ツ", ""
}

(Playground)

Similarly, the string "ツA" has character boundaries at 0, 3, and 4, because its characters are at s[0..3] and s[3..4].

It's often useful to think of zero-based indices as pointing between the elements of an array, so each element has an index that points to its start and one that points to its end. The length of the array corresponds to an index that points at the end of its last element. So the byte indices of "ツA" could be drawn like this:

+---+---+---+---+
|    TSU    | A |
+---+---+---+---+
0   1   2   3   4

…where the pipe characters (|) represent the character boundaries.

system · November 14, 2021, 12:36am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
String unicode awareness help	14	904	March 19, 2022
How to slice a `str` properly? help	5	1672	January 12, 2023
Confusion about strings help	5	1446	January 12, 2023
Do there exist unicoded strings where len()/python and len()/rust are different?	8	2369	January 12, 2023
How come len() knows all these strings? help	5	259	November 19, 2023

Len and is_char_boundary mismatch

Related Topics