Do there exist unicoded strings where len()/python and len()/rust are different?


#1
  1. I understand that char != byte, that a char can be 1-4 bytes depending on what it is trying to encode.

  2. I am a bit confused on what counts as a char. Sometimes, what “visually looks like one char” ends up being two chars (base + accent).

  3. I am in a sitaution where I am dealing with unicode strings (and indexes generated by Python, reading back in Rust). Most of the time my code gets the right word boundaries (thus, probably not off by 1); however, sometimes I get garbage.

  4. My question: how well defined is the concept unicode char/string? Is it possible to have a unicode string/symbol where Python/Rust decides it takes different numbers of chars ?


#2

To answer the question in the title: absolutely yes. Python 3 (from what I recall) indexes based on code points, Rust indexes on bytes. The two use incompatible representations of strings, so you can’t directly use indices from one in the other.

To translate a Python (code point) index into a Rust (byte) index, you’d need to walk over the text in Rust using str::char_indices to determine where in the string each code point begins.

Also, “character” is a largely meaningless word. There are too many different, incompatible things it can mean to different people, or in different environments. It helps to be more specific (if only for your own sake). Depending on context, “char” can refer to bytes, code units, code points, grapheme clusters, glyphs, or possibly something else.

That’s not even getting into what even counts as a “visually looks like one char”. In theory, it can be any number of bytes (one, two, twenty, more), and might be “visually looks like one char” in one program, but “visually looks like two chars” in a different program on the same machine.

Any time you think something involving text is simple, it probably isn’t. It’s probably a screaming nightmare of complexity and edge cases that only gets worse over time. :stuck_out_tongue:


TWiR quote of the week
#3

IMO char on this forum is well defined as documented. Pretty much like by-default written numbers are always decimal. For anything else, use should be written with context that (tries to) avoids ambiguity.


#4

When in doubt, look at examples:

>>> len(b'☃')
3
>>> len(u'☃')
1
>>> u'☃'.encode('utf-8')
'\xe2\x98\x83'

In this example, the length of a byte string is the number of bytes in it. The length of a Unicode string, however, appears to be the number of codepoints. One other possible alternative is that the length of a Unicode string is the number of visual characters, approximated by Unicode’s grapheme clusters. We can test that with example too:

>>> len(u'a\u0300')
2
>>> print(u'a\u0300')
à

In this case is a single visual character that Unicode specifies as a grapheme cluster made up of two codepoints. Despite it being one visual character, len(u"à") still reports the length of the string as 2, so we can reasonably conclude that len(unicode-string) in Python returns the number of codepoints, which is neither the size of it in memory nor the number of visual characters. Effectively, for text processing, asking the length of string doesn’t have a lot of value without any context. There are valid reasons for wanting all 3 of the interpretations outlined here!

In Rust, calling x.len() always yields the number of bytes in the string, whether its a &[u8] or &str. In order to get the number of codepoints, you need to explicitly count them, e.g., x.chars().count(). In order to get the number of visual characters (approximated by grapheme clusters), you also need to explicitly count them too, e.g., x.graphemes(true).count() (using the unicode-segmentation crate).

Not all string types are created equal. I’m not intimately familiar with Python’s internal representation of its string type, but it’s entirely possible that for Python, counting the number of codepoints is a constant time operation if its in-memory representation is itself a sequence of codepoints. (In that case, counting the number of bytes used up by the string when encoded in UTF-8 is no longer a constant time operation.) In Rust, strings are always represented as UTF-8 encoded bytes in memory, so counting the bytes is always cheap. Counting the bytes is important for things like, “how much space do I need to represent this string in memory,” which can be important for performance reasons (e.g., preallocating space).

The datatype called char in Rust is always a 32 bit integer, and its only legal inhabitants are the set of Unicode scalar values. That is, the inclusive range 0-0x10FFFF excluding the surrogate codepoint range. A single visual character may be made up of more than one char value. There is no reliable correspondence you can assume here other than by following Unicode’s segmentation algorithms.

If you have indexes generated in Python, then you probably need to make sure that those indexes get mapped to byte offsets. If you have byte offsets, then they can be efficiently used on the Rust side. If you’re working with Unicode strings in Python, then you probably have codepoint offsets. You have two choices:

  1. Map the codepoint offsets to byte offsets before handing them off to Rust. You can do this by taking a pass through your string and converting codepoint offsets to byte offsets as you go.
  2. Hand codepoint offsets to Rust and either convert them there, use char_indices, or chars().nth(n).

The choice you make is likely influenced by how you want your API to look like and what your performance concerns all. Not all choices above are equivalent from a performance perspective.

If you’re using semantically identical operations with correct implementations, then they should be equivalent from a Unicode perspective. But as shown above, len on Python Unicode strings is not the same as len on a Rust &str. They are semantically distinct operations.


#5

@DanielKeep , @jonh , @BurntSushi :

Thanks for the detailed responses. I have decided to not use the python unicode indices. :slight_smile:


#6

Do there exist unicoded strings where len()/python and len()/rust are different?

It’s worse than this… There are unicode strings in which len() in python and len() in python are different. (Yes i wrote python twice)
And I am not talking about a different version!

Internally python uses UCS-2 or UCS-4 to represent strings. There is a compilation flag to decide which one you prefer. Ubuntu ships with UCS-4 but MacOS ships with UCS-2.

So if I type len(u"🤣") , depending on the laptop I use, I get 1 or 2.


#7

@fulmicoton : Funnily, I think the issue you brought up explains the Python/Rust index-mismatch I was running into.

I believe https://pypi.org/project/codepoints/ is the same issue you are describing. Calculating the “off by” error, it looks like the Python indices are in UTF-16 mode, while Rust is doing “wide unicode.” :slight_smile:


#8

I think you’ve already gotten lots of good advice, but let me just touch on this point:

The encoding of Unicode code points into, say, UTF-8 is well-defined and will match exactly between the two languages.

If you use s.encode('utf8') in Python, then the byte string you get back will behave very similarly to a Rust string - it’s length will be in bytes and you can slice it to get constant time access to parts of it.