Accessing the char at a byte index


#1

I get why Index<usize> wasn’t implemented (because the Index trait needs a reference), but what’s the workaround?

let c = string[i..].chars().next().unwrap();

Mine is hideous, so hopefully there’s a better way. :frowning:


#2

From the docs:

Indexing is intended to be a constant-time operation, but UTF-8 encoding does not allow us to do this. Furthermore, it’s not clear what sort of thing the index should return: a byte, a codepoint, or a grapheme cluster. The bytes and chars methods return iterators over the first two, respectively.

So you need to iterate through the string to find the character you want. Something like this:

my_string.chars().nth(i).unwrap()

my_string.bytes().nth(i).unwrap()

#3

Char-based indexing can’t be constant-time, but getting the char at a byte index could be. It just needs to panic or return Option<char> in case your index is a UTF-8 continuation.


#4

If you want the byte at a given offset, that is indeed constant time:

let byte: u8 = my_string.as_bytes()[i];

#5

I want the char that starts at a given byte offset.


#6

@quadrupleslap Your original solution looks good to me.


#7


My reply may be off base, but your statement above implies to me that you know in advance that the specific byte offset(s) of interest are not in the middle of multi-byte UTF-8 codepoints. If that is truly the case then you can unsafely coerce what you are asking for. Otherwise you need to specify, for indices that actually refer to the middle of multi-byte codepoints, what ASCII replacement character you want for that non-ASCII character, and probably use cow() to give you the cleaned string when the original is unusable.

Another alternative is that your string is really constrained to 7-bit ASCII, in which case you can use the code on pp 528-530 of the new book Programming Rust: Fast, Safe Systems Development to implement an efficient ASCII string type. Quoting from that book:

Here’s the definition of Ascii, a string type that ensures its contents are always valid ASCII. This type uses an unsafe feature to provide zero-cost conversion into String:

The signature of that conversion operator is
from_bytes(bytes: Vec<u8>) -> Result<Ascii, NotAsciiError>