Unicode strings


#1

Hello.

I was trying to figure out why so many technologies use UTF-8 strings instead of UTF-16. It makes no sense for me. UTF-16 are 2 bytes wide, no matter which character it represents. So taking the first 2 byted of a UTF-16 string, you have the first character, no matter what. This eliminates checks and operations which may cause overhead on intensive use of UTF-8 strings(?).

A point would be that UTF-8 strings use less space. That’s partially true. “abc” is 3 bytes in UTF-8 and 6 in -16. But others 3 characters from a language like japanese, chinese, russian or korean, for example (I do not know the exact length of these), can take 12 bytes instead of the same 6 bytes of UTF-16.

UTF-8 is widely used in many PL and technologies, like html5 and rust itself. The only language I see that uses UTF-16 is Java, however many people call it “obsolete”. (I don’t see a point in using UTF-32. 4 billion characters? Maybe encoding some alien languages and such?)

I searched a little bit about the subject (maybe not enough) but I couldn’t find an answer. Why people often choose UTF-8 instead of -16? Maybe Google “sponsored” it, or people think it’s better because of who created/designed it, which are quite famous?

Example: How to check if a UTF-16 string is invalid? It’s size (in bytes) is a odd number. (among other checks, but that’s a fast check) And on a UTF-8? It demands a full check of the entire string


#2

That’s not actually true. It’s still a variable length encoding where some Unicode code points need two UTF-16 code units. See the Wikipedia page for details:


#3

Example: How to check if a UTF-16 string is invalid? It’s size (in bytes) is a odd number.

This is not exactly correct. The string must as well not include unpaired surrogates: http://unicode.org/faq/utf_bom.html#utf16-7


#4

No, UTF-16 is variable-length as well. In addition to the Wikipedia page, read this page:

http://utf8everywhere.org/


#5

It’s a common misconception that characters are always two bytes wide in UTF-16. The truth is that by now, there are so many Unicode characters, that both in UTF-8 and UTF-16, character can be up to four bytes wide. From this point of view, UTF-16 has the same disadvantage as UTF-8 (variable character width) while not having its advantage of ASCII compatibility.

I’m not a pundit on the topic, but I’ve recently read the UTF-8 Everywhere Manifesto. Of course, the title already is heavily biased, so you might want to consult other sources as well. But this website lists some very extensive reasons for the choice of one encoding over the other.

PS: Whoops, was a bit too slow here. :sweat_smile:


#6

As several people have pointed out, UTF-16 is not fixed width; you’re thinking of the older, no longer relevant UCS-2, which was replaced by UTF-16 in Unicode 2.0 back in 1996.

The original design of Unicode was to have a single set of fixed with 16 bit characters, but it turns out that that isn’t actually sufficient; it requires compromises like CJK unification that were very controversial, requires ignoring obscure or archaic characters that people still need to use as they may be part of people’s names, used in academic literature, or the like, and provides too little expansion room.

For one example of commonly used characters that are outside of the original 16 bit range (the BMP, basic multilingual plane), the Emoticons (Emoji) block is outside of the BMP.

Even if you did have each codepoint take a fixed amount of space, as in UCS-4 (UTF-32), for most actual text process purposes you have to deal with variable length sequences. Combining characters, for instance, are rendered as a single glyph, but use two codepoints.

It is unfortunate that UTF-16 has become as popular as it has. It is used not just in Java, but also in APIs in Windows, OS X, JavaScript, and so on. It means that you need to have two different incompatible string types in many APIs (one for traditional 8 bit C strings, one for “wide” strings), it means that lots of people make the assumption, like you did, that it’s a fixed width, 16 bit encoding and make mistakes on that basis. UTF-8 already existed at the time of it’s introduction, and has much lower impact on data types and APIs, allowing strings using 8 bit code units to be used everywhere, but there was enough inertia from the original thought that Unicode would be a fixed width 16 bit type that a lot of work had already gone into making that happen.

There are lots of good sources on this subject already. the UTF-8 Everywhere Manifesto provides some good arguments in favor of UTF-8, including why it should even be used on Windows when the Windows API prefers UTF-16, the UTF-8 FAQ provides some good background information and information on UTF-8 on Unix like systems, Hello World, or Καλημέρα κόσμε, or こんにちは 世界 by Rob Pike and Ken Thompson describing how UTF-8 made it so easy to add Unicode support to Plan 9 incrementally and without having to rewrite the world with different APIs and data structures, the UTF-8 history from Rob Pike describing the original design process, and there are lots of other answers available on StackOverflow, such as this previous answer I wrote, and various discussion forums, such as yet another thread on Reddit, about why UTF-8 is preferable to UTF-16.

As an aside, I’m not linking to my own previous threads because I think that they’re the best sources available, the original sources I linked are better and there are lots of other discussions out there, but just because they contain several arguments I have already made in the past, so I’m trying not to repeat them here.


#7

Well, this explains everything. I thought there was no more than 65k characters, but considering the emoticons, symbols and etc UTF-8 seems indeed better. Because if more characters are needed, 1 byte would be added instead of 2 or 4, increasing as necessary.

I’ve read more about UTF-8 than UTF-16, I found the encoding very interesting, how multi-byte characters are handled and etc, but quite ineficient (I thought UTF-16 has fixed length, like many others, so no unnecessary computations about wide-characters). If UTF-16 could handle every character and it’s content was only the value itself (no sign bit, etc) it would be more efficient, but that isn’t true.

I’m sorry for the ignorance and thanks for the replies. I’m somewhat surprised that anyone judged my stupid question, and I’m glad for that.


#8

It’s absolutely not a stupid question! Text encoding is really hard and even the best of us goof it up now and then. :slight_smile: