Do there exist unicoded strings where len()/python and len()/rust are different?

DanielKeep · January 23, 2019, 2:29pm

To answer the question in the title: absolutely yes. Python 3 (from what I recall) indexes based on code points, Rust indexes on bytes. The two use incompatible representations of strings, so you can't directly use indices from one in the other.

To translate a Python (code point) index into a Rust (byte) index, you'd need to walk over the text in Rust using str::char_indices to determine where in the string each code point begins.

Also, "character" is a largely meaningless word. There are too many different, incompatible things it can mean to different people, or in different environments. It helps to be more specific (if only for your own sake). Depending on context, "char" can refer to bytes, code units, code points, grapheme clusters, glyphs, or possibly something else.

That's not even getting into what even counts as a "visually looks like one char". In theory, it can be any number of bytes (one, two, twenty, more), and might be "visually looks like one char" in one program, but "visually looks like two chars" in a different program on the same machine.

Any time you think something involving text is simple, it probably isn't. It's probably a screaming nightmare of complexity and edge cases that only gets worse over time.

Topic		Replies	Views
Python-like string in Rust help	22	954	March 22, 2023
Char type in rust help	3	449	July 29, 2020
Documenting that unicode-escaped characters in utf-8 literals use utf-32 representation	4	1507	January 12, 2023
String implementation community	8	856	September 21, 2022
Why we need "char" data type? help	3	396	October 20, 2023

Do there exist unicoded strings where len()/python and len()/rust are different?

Related Topics