Do there exist unicoded strings where len()/python and len()/rust are different?

To answer the question in the title: absolutely yes. Python 3 (from what I recall) indexes based on code points, Rust indexes on bytes. The two use incompatible representations of strings, so you can't directly use indices from one in the other.

To translate a Python (code point) index into a Rust (byte) index, you'd need to walk over the text in Rust using str::char_indices to determine where in the string each code point begins.

Also, "character" is a largely meaningless word. There are too many different, incompatible things it can mean to different people, or in different environments. It helps to be more specific (if only for your own sake). Depending on context, "char" can refer to bytes, code units, code points, grapheme clusters, glyphs, or possibly something else.

That's not even getting into what even counts as a "visually looks like one char". In theory, it can be any number of bytes (one, two, twenty, more), and might be "visually looks like one char" in one program, but "visually looks like two chars" in a different program on the same machine.

Any time you think something involving text is simple, it probably isn't. It's probably a screaming nightmare of complexity and edge cases that only gets worse over time. :stuck_out_tongue:

7 Likes