Support beyond UTF-8?

Is there support for character sets beyond UTF-8? Flexibility is important, yet everything I've read to date seems to refer only to UTF-8.. which is naturally a Western centric default but it can cause problems.

I don't know whether such issues have moved on since 2001 when this article was published.. The Secret Life of Unicode A Peek at Unicode's Soft Underbelly
Are there routes via C libs to somehow navigate any differences or is this something planned for the future in Rust?

Rust's String type is defined to be UTF-8 and its char type is a Unicode scalar value. I don't foresee that changing.

There are external crates that support other encodings: https://crates.io/crates/encoding --- I'm doubtful that std will support other encodings any time soon. It definitely seems like something that should bake outside of std.

3 Likes

It would be abject insanity to support multiple internal encodings directly in Rust.

Text is hard enough as it is without introducing multiple forms of text within Rust. Unicode isn't perfect, but it's the very best we've got.

As for using UTF-8 specifically, I've never seen a convincing argument for picking anything else. You can make arguments for other encodings for more specific purposes (Rust already has WTF-8 for Windows), but not for the core string type.

Incidentally, D supported three different string encodings at the language level. What happened in practice was that most people just ignored the two that weren't UTF-8 because it was too much trouble to bother about.

2 Likes

There's a slight misuse of terminology here that's important to clarify or else people will talk past each other: UTF-8 is just a way to encode the Unicode character set.

It is reasonably non-Western centric although some languages get compression benefits from say, UTF-16, since characters are encoded as 3 bytes in the former, but 2 in the latter. In fact, the ideas behind UTF-8 can be used to encode any sort of sequence where the elements are represented/representable as 31-bit integers (the actual UTF-8 is restricted to 21 bits, and removes a subrange, but this isn't necessary). Given Rust uses Unicode for the elements of strings, there's pretty much no better encoding for arbitrary ones than UTF-8.

However, as the article @davidpbrown linked states, the Unicode standard itself has some problems, particularly around the handling of CJKV. I don't know of any current work to tackle that in Rust, say an implementation of TRON, although there are skilled Rust users from countries most affected by Unicode's inadequacies (particularly Korea, including the author of encoding).

3 Likes

Right, but the problem is that any new scheme that is not functionally equivalent to Unicode must necessarily include things that can't be expressed in Unicode... which means it will be fundamentally incompatible with all extant code that deals with strings. My experience with D is that trying to be clever about encoding text just doesn't work; even if there are real benefits to a particular encoding (like using UTF-16 on Windows), the ecosystem as a whole will just gravitate to a single representation for practical reasons.

Just to try and be clear: the vibe I'm getting from the original poster's question was that you should be able to alter the basic string type to something other than UTF-8. From my perspective, it's not worth talking about mere support for other encodings; obviously, yes, those can be done without issue.

As an aside: my understanding was that Han Unification, at least, had been addressed by the introduction of regional variation markers. Arbitrary new characters hasn't, and I'm frankly terrified by the prospect of having to write software to deal with that. Plz stahp makin up new characters im beggin u gais!

As I understand it, D offers UTF-8, UTF-16 and UTF-32, all of which are just different ways to store the Unicode character set in memory. They don't have any semantic differences, and all represent the exact same thing. While I agree that the Rust ecosystem as a whole will almost certainly stick to using Unicode as the character set (and hence UTF-8 as the encoding), I'm just trying to ensure everyone is on the same page early in this discussion.

(My comment is directed at both you and the original poster.)

Yet that difference perhaps is very important, at least in display and handling of strings.

Relative to the problem, I consider there's no difference between those; they are Unicode.. I mention UTF-8 because Rust appears to focus on that. Even expecting the user to manage the double byte chars would not be a big deal but refusing to handle any non Unicode, seems odd.

Even if the underlying system works with a preferred character set, it perhaps should be able to easily contend with alternates. The idea of html not being able to work with anything beyond Unicode would be considered a fundamental weakness.

How does C..C# deal with this issue?.. I expect Rust could look to do the same. I was just surprised that Rust does not obvious acknowledge this issue up front. Also, I have non Unicode html that I might parse just for the sake of learning Rust and it seems that might not be trivial in the way I was expecting. At a minimum, I would expect capability to accept as string, store and then display, any major character set.

A quick search throws this:

UTF-8 is used by 84.3% of all the websites whose character encoding we know.

Usage of character encodings for websites though that in part might reflect Unicode's dominance rather than a preference for it.

Ok, now I'm confused as to what you're talking about. Let's take this back:

  • The fundamental string type in Rust is UTF-8.

  • With the exception of a few methods that are restricted to ASCII, and some methods that directly operate on OsStrs (which are in a platform-specific encoding), all string-handling functions in Rust are defined in terms of UTF-8 and/or Unicode.

  • There is absolutely no impediment to transcoding between UTF-8 and any other character encoding in a library (see the encoding crate).

Now, for opinion:

  • Insofar as I know, no other language does anything useful here. C says more or less nothing about character encoding, which is why it's such a pain to do anything with text in C. Every C library that does deal with text that I've ever seen is either dealing with ASCII, Unicode, or a codepage that's compatible with Unicode.

  • C# is Unicode, end of story.

  • Unicode is what every major platform has standardised on, so far as I know. There is vanishingly little value in supporting anything else.

    If you wanted to support one of the Japanese-specific non-Unicode encodings that allows for arbitrary kanji, for example, support in Rust would not be sufficient. You would need a whole new string type, new string methods, new fonts, new layout, new rendering. You'd basically have to either build or get non-standard implementations of everything between the raw text data and the pixels on the screen.

    It should go without saying that such a string type would be incompatible with almost all extant Rust string-handling code in existence.

To put it succinctly: the amount of effort required to make it practically possible to use something other than Unicode so vastly dwarfs any possible benefit that I just don't see the point in even considering it. You'd be better off just... writing a new string type and using that, instead.

The only feature I can think of that would be useful is custom literal support, which would already be handy for things like UTF-16 string literals for Win32 code, or bignums.

1 Like

To be clear, that is exactly what my two comments are referring to, e.g.

UTF-8 is just the best way to encode Unicode, if you wish to focus on character sets rather than specific encodings of them the term "Unicode" is much clearer (and all of my comments are just trying to clarify the difference, not discussing any particular side).

Indeed, in the past 14 years Unicode has evolved considerably, and at this point many of those issues mentioned in that document are no longer relevant, would be even worse if you had to support other encodings or character sets as well, or are simply essential complexities of the problem of encoding multilingual text.

I'll address the issues in that document in the order presented.

Most of the East Asian issues such as Han unification and missing characters have been dealt with by the introduction of more characters, including in the supplementary ideographic plane, the introduction of variation selectors allowing you to disambiguate forms, and better browser and OS font selection support. In 2001 when the article was written, support beyond the BMP was quite limited; nowadays, most major operating systems, UI frameworks, browsers, and so on support it reasonably well. Application support by third party applications can still be somewhat mixed, but that's the case for any form of internationalization and localization.

The ordering issues mentioned are not particularly interesting. Across the entirety of Unicode, there's no way that ordering could work based on codepoint ordering alone; if you want to do collation correctly, you need to to have support for specialized tables, and Unicode collation algorithm specifies this, as well as language specific specializations being available in the CLDR.

None of the competitor encodings mentioned have gained traction. The industry has moved on, Unicode is here to stay, and extending an enhancing Unicode to support the use-cases that it was weak at before is more useful than introducing new, entirely incompatible character sets and encodings.

As mentioned in the document, bidirectional text handling and its interaction with UIs is really an essential complexity, and not a problem with Unicode. It's also a UI and rendering issue, not an encoding issue; the encoding is perfectly well specified in Unicode, that text is laid out in logical order in memory and bidirectional handling occurs at the display and UI level.

I have no idea what the "dancing positional characters" complaint is about; if you are writing a script that uses complex shaping, then yes, things will move around as you type, the only alternative I can think of is for it to not show up at all until you have typed a whole word, which would probably be even more confusing. Again, this is a UI issue, not a Unicode issue.

The zero-width character issue is also a UI issue.

Bidi is mentioned again, and also as a rendering issue.

The article complains about compatibility with existing standards making Unicode complex. This is true, it does add to the complexity; however, it's what allows Unicode to be a universal encoding, suitable as the one internal encoding for software. Since there are round-trip mappings between it and all of the important legacy character sets it replaces, Unicode can be used internally, and conversion in and out can happen at the boundaries. This is exactly the approach that Rust takes; it has one internal character set and encoding, and if you want to work with text in other encodings you map as text comes in and out.

Inconsistencies are a consequence of this compatibility. Yes, they make things more complex, but they are necessary for being able to use a single universal encoding and convert to and from legacy encodings losslessly.

Positional forms, not sure of the history of the differences, but pretty willing to bet it's based on this legacy compatibility; that compatibility is more important than the consistency that would be gained by picking one convention for the entirety of Unicode.

Inconsistency in subjoined letters, same answer as previous.

Logical vs. visual ordering was already addressed (this document seems to be repeating itself).

ASCII being a its own block mixing control characters, alphabetic characters, and punctuation. This is one of the most important compatibility considerations, that allows UTF-8 to work with existing APIs. A very large number of legacy character sets were likewise ASCII compatible, so this is no different from the status in the vast majority of other character sets that could be chosen.

Equivalency confusion. This is a real problem, but also, I think, a matter of essential complexity. There is no way I know of to solve this that isn't going to cause some other problem; for instance, you could do much more aggressive unification, but then you run into all of the problems that aggressive unification causes, as was complained about in the Han unification sections.

Precomposed vs. decomposed. Also a real problem, but I think necessary for compatibility with legacy character sets.

And finally, Unicode is not Internationalization. Very true, but not a problem with Unicode itself. Unicode is just a foundational building block of internationalization, which makes the rest of the problems vastly easier due to having a single, common, unified standard to be build upon, rather than additionally adding on the complexity of having to support multiple different, incompatible character sets.

Some of these arguments are legitimate complaints about complexity, but most of the complexity exists for a good reason; and supporting more encodings natively would increase complexity dramatically, not decrease it. Some of these arguments are just UI issues. Some have been dealt with. Some may be legitimate criticism, but I don't know of any other encodings that address those criticisms, and even if there were, Unicode text is the majority of text you will encounter so you still have to support it.

C has no standardized support for any particular character sets. You have char, which is defined to be at least 8 bits, and wchar_t, which is defined to be large enough to hold any one code point, but in practice is 16 bits on some platforms (and thus not large enough to hold any code point), and 32 bits on others.

In practice, POSIX-like platforms and APIs generally accept char based strings that are expected to be null-terminated and ASCII compatible, but most of them simply pass through anything outside of the ASCII range without caring about it, so are mostly compatible with legacy encodings like the ISO-8859-* standards as well as UTF-8. In practice, UTF-8 has become the de-facto standard encoding for such uses.

Windows, on the other hand, chose to go with wide characters for its Unicode support. It uses 16 bit wchar_t, and so uses UTF-16. A good number of other platforms, such as Carbon and Cocoa on OS X, Java, JavaScript, also chose UTF-16 as their default string encoding (in some cases via haven chosen UCS-2 before Unicode was extended and UTF-16 was designed for backwards compatibility).

C# uses UTF-16, like the rest of the Windows API. It has APIs for encoding and decoding between its internal format and a variety of external formats. But it only supports one native string type, which is UTF-16 encoded.

In general, you just decode such text into the native string type on input (in Rust's case, UTF-8), and encode it on output. There are no character sets in widespread use that are anything other than a subset of Unicode. The encoding crate mentioned earlier is a good way to do that. Rust has so far erred towards relying the crate ecosystem for things like this rather than putting it in the standard library; as a separate crate, it can be updated and maintained independently of the Rust release cycle, which gives you a lot more flexibility than if the updates had to be in lockstep with the standard library.

It's also possible to work with text in other encodings as a byte array; but there are no convenient string methods provided by the standard library for handling that, you would need to roll your own. I don't see that changing, as it is generally much simpler to just decode and encode at the boundaries than support multiple incompatible encodings and code unit widths in the standard library.

10 Likes

I think this site answers to some questions from the thread opener — http://utf8everywhere.org

1 Like