Turkish latter problem

I am getting an error with these I solve my problem but I still feels odd about it so basically the problem is insert and remove I didn't try other functions if you add 'ş' char in first there is no error or you can add 'ş' in idx 1 after that you can insert anything idx 0 no error but if you want insert a char idx 3 you'll getting an error about 'ş'.

fn main() {
    let mut a = String::from("Hello World!");
    println!("{}",&a);
    a.insert(1,'ş');
    println!("{}",&a);
    
    a.insert(2,'b');
    println!("{}",&a);
}

(Playground)

String::insert uses byte index, not character index, and second byte is in the middle of ş character.

5 Likes

Strings and strs in Rust are UTF8 encoded, but indexing is based on byte position (otherwise it would have to scan the entire prefix to count code points, and a count of code points isn't useful as far as the human interpretation of a String goes -- multiple code points can combine into what a human would consider a single glyph, like an emoji or some accented letters, et cetera).

If you try to index into the middle of the encoding of a single code point, for example if your insert would break up the encoding of a single code point, you'll get a panic. The letter you inserted into index 1 takes two bytes to encode. After that the String contains:

 One letter, multiple bytes
     /       \
| H |    ş    |  e |  l |  l |  o |   | W |  o |  r |  l |  d | ! |
[ 72, 197, 159, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100, 33]
   0    1    2    3  ...

So now attempting to insert at index 2 would panic because that breaks up the encoding of ş (which would result in invalid UTF8).


Trying to manipulate Unicode text[1] to do things like slicing, dicing, inserting, and so on is complicated. To do it properly you need support beyond that which is built into the language and standard library, e.g. this crate.

Respecting code point boundaries can be done without such a dependency, and will let you avoid panics. But you'll still produce garbage in many cases (e.g. inserting into the middle of a grapheme cluster).

Unicode is a large topic but hopefully this gives you a place to start.


  1. that's not actually severely restricted, e.g. ASCII ↩︎

5 Likes