Do there exist unicoded strings where len()/python and len()/rust are different?

When in doubt, look at examples:

>>> len(b'☃')
3
>>> len(u'☃')
1
>>> u'☃'.encode('utf-8')
'\xe2\x98\x83'

In this example, the length of a byte string is the number of bytes in it. The length of a Unicode string, however, appears to be the number of codepoints. One other possible alternative is that the length of a Unicode string is the number of visual characters, approximated by Unicode's grapheme clusters. We can test that with example too:

>>> len(u'a\u0300')
2
>>> print(u'a\u0300')
à

In this case is a single visual character that Unicode specifies as a grapheme cluster made up of two codepoints. Despite it being one visual character, len(u"à") still reports the length of the string as 2, so we can reasonably conclude that len(unicode-string) in Python returns the number of codepoints, which is neither the size of it in memory nor the number of visual characters. Effectively, for text processing, asking the length of string doesn't have a lot of value without any context. There are valid reasons for wanting all 3 of the interpretations outlined here!

In Rust, calling x.len() always yields the number of bytes in the string, whether its a &[u8] or &str. In order to get the number of codepoints, you need to explicitly count them, e.g., x.chars().count(). In order to get the number of visual characters (approximated by grapheme clusters), you also need to explicitly count them too, e.g., x.graphemes(true).count() (using the unicode-segmentation crate).

Not all string types are created equal. I'm not intimately familiar with Python's internal representation of its string type, but it's entirely possible that for Python, counting the number of codepoints is a constant time operation if its in-memory representation is itself a sequence of codepoints. (In that case, counting the number of bytes used up by the string when encoded in UTF-8 is no longer a constant time operation.) In Rust, strings are always represented as UTF-8 encoded bytes in memory, so counting the bytes is always cheap. Counting the bytes is important for things like, "how much space do I need to represent this string in memory," which can be important for performance reasons (e.g., preallocating space).

The datatype called char in Rust is always a 32 bit integer, and its only legal inhabitants are the set of Unicode scalar values. That is, the inclusive range 0-0x10FFFF excluding the surrogate codepoint range. A single visual character may be made up of more than one char value. There is no reliable correspondence you can assume here other than by following Unicode's segmentation algorithms.

If you have indexes generated in Python, then you probably need to make sure that those indexes get mapped to byte offsets. If you have byte offsets, then they can be efficiently used on the Rust side. If you're working with Unicode strings in Python, then you probably have codepoint offsets. You have two choices:

  1. Map the codepoint offsets to byte offsets before handing them off to Rust. You can do this by taking a pass through your string and converting codepoint offsets to byte offsets as you go.
  2. Hand codepoint offsets to Rust and either convert them there, use char_indices, or chars().nth(n).

The choice you make is likely influenced by how you want your API to look like and what your performance concerns all. Not all choices above are equivalent from a performance perspective.

If you're using semantically identical operations with correct implementations, then they should be equivalent from a Unicode perspective. But as shown above, len on Python Unicode strings is not the same as len on a Rust &str. They are semantically distinct operations.

4 Likes