share
Stack OverflowShould UTF-16 be considered harmful?
[+89] [19] Artyom
[2009-06-26 16:04:18]
[ unicode utf-16 considered-harmful ]
[ http://stackoverflow.com/questions/1049947] [DELETED]

I'm going to ask what is probably quite a controversial question: "Should one of the most popular encodings, UTF-16, be considered harmful?"

Why do I ask this question?

How many programmers are aware of the fact that UTF-16 is actually a variable length encoding? By this I mean that there are code points that, represented as surrogate pairs, take more then one element.

I know; lots of applications, frameworks and APIs use UTF-16, such as Java's String, C#'s String, Win32 APIs, Qt GUI libraries, the ICU Unicode library, etc. However, with all of that, there are lots of basic bugs in the processing of characters out of BMP (characters that should be encoded using two UTF-16 elements).

For example, try to edit one of these characters:

You may miss some, depending on what fonts you have installed. These characters are all outside of the BMP (Basic Multilingual Plane). If you cannot see these characters, you can also try looking at them in the Unicode Character reference [5].

For example, try to create file names in Windows that include these characters; try to delete these characters with a "backspace" to see how they behave in different applications that use UTF-16. I did some tests and the results are quite bad:

It seems that such bugs are extremely easy to find in many applications that use UTF-16.

So... Do you think that UTF-16 should be considered harmful?

(3) This should be a wiki - rijipooh
I tried copying the characters to a filename and tried to delete them and had no problems. Some Unicode characters read right to left and keyboard input handling sometimes changes to accommodate that (depending on the program used). Can you post the numeric codes for the specific characters you are having trouble with? - CiscoIPPhone
Have you tried to work with them in Notepad and see how this work? For example edit file name with this character and put coursor at the right of this character and press backspace. You'll see that in both. Notepad of file name editing dialog it requires two times to press "backspace" to remove this character. - Artyom
(1) The double backspace behavior is mostly intentional blogs.msdn.com/michkap/archive/2005/12/21/506248.aspx - CiscoIPPhone
(5) Not really correct. I explain, if you write "שָׁ" the compound character that consists of "ש",‎ "ָ" and "ׁ", vovels, then removal of each one of them is logical, you remove one code-point when you press "backspace" and remove all character including vovels when press "del". But, you never produce illegal state of text -- illegal code points. Thus, the situation when you press backspace and get illegat text is incorrect. - Artyom
Are you referring to how sin and shin are composed of two code points, and by deleting the code-point for the dot you get an "illegal" character? - patjbs
(2) No, you get "vowelless" writing. It is totally legal. More then that, in most of cases vowels like these (shin/sin) are almost ever written unless they are required for clearification of something that is not obvious from context like שׁם and שׂם these are two different words, but according to context you know which one of is vowelless שם means. - Artyom
(3) CiscoIPPhone: If a bug is "reported several different times, by many different people", and then a couple years later a developer writes on a dev blog that "Believe it or not, the behavior is mostly intentional!", then (to put it mildly) I tend to think it's probably not the best design decision ever made. :-) Just because it's intentional doesn't mean it's not a bug. - Ken
For the record, I don't have problems with any of these characters in Apple's TextEdit.app (which uses Cocoa and thus UTF-16), but trying to insert them in Emacs (which uses a variant of UTF-8 internally) produces garbage. I do think that such bugs are not the fault of the character encoding, but of the lack of competence of the programmers involved. - Philipp
BTW, I've just checked editing these letters, they don't give me a problems neither in Opera, nor in Windows 7. Opera seems to edit them properly, so does Notepad. File with these letters in the name has been created successfully. - Malcolm
@Malcolm, 1st there is no problem creating such files - the question editing them. Now I've tested on XP maybe in 7 MS fixed this issue. Take a look how backspace works, do you need to hit it once or twice. - Artyom
Once. I specially checked for this issue, and in Windows 7 the problem with the characters beyond BMP seems to be gone. Maybe this problem had been solved even in Vista. - Malcolm
@Malcolm - even thou it does not make UTF-16 less harmful :-) - Artyom
Well, I don't think that mere existence of crappy implementations indicates harmfulness of the standard at all. :p This is just an update on the current situation: how problematic characters beyond BMP in Windows (and Opera) are now. - Malcolm
(5) Great post. UTF-16 is indeed the "worst of both worlds": UTF8 is variable-length, covers all of Unicode, requires a transformation algorithm to and from raw codepoints, restricts to ASCII, and it has no endianness issues. UTF32 is fixed-length, requires no transformation, but takes up more space and has endianness issues. So far so good, you can use UTF32 internally and UTF8 for serialization. But UTF16 has no benefits: It's endian-dependent, it's variable length, it takes lots of space, it's not ASCII-compatible. The effort needed to deal with UTF16 properly could be spent better on UTF8. - Kerrek SB
@Kerrek: Great summary. - tchrist
UTF-8 has the same caveats as UTF-16. Buggy UTF-16 handling code exists; although probably less than buggy UTF-8 handling code (most code handling UTF-8 thinks it's handling ASCII, Windows-1252, or 8859-1) - Ian Boyd
[+57] [2009-12-06 13:20:21] Pavel Radzivilovsky [ACCEPTED]

Opinion: Yes, UTF-16 should be considered harmful. The very reason it exists is because some time ago there used to be a misguided belief that widechar is going to be what UCS-4 now is.

Despite the "anglo-centrism" of UTF-8, it should be considered the only useful encoding for text. One can argue that source codes of programs, web pages and XML files, OS file names and other computer-to-computer text interfaces should never have existed. But when they do, text is not only for human readers.

On the other hand, UTF-8 overhead is a small price to pay while it has significant advantages. Advantages such as compatibility with unaware code that just passes strings with char*. This is a great thing. There're few useful characters which are SHORTER in UTF-16 than they are in UTF-8.

I believe that all other encodings will die eventually. This involves that MS-Windows, Java, ICU, python stop using it as their favorite. After long research and discussions, the development conventions at my company [1] ban using UTF-16 anywhere except OS API calls, and this despite importance of performance in our applications and the fact that we use Windows. Conversion functions were developed to convert always-assumed-UTF8 std::strings to native UTF-16, which Windows itself does not support properly [2].

To people who say "use what needed where it is needed", I say: there's a huge advantage to using the same encoding everywhere, and I see no sufficient reason to do otherwise. In particular, I think adding wchar_t to C++ was a mistake, and so are the Unicode additions to C++0x. What must be demanded from STL implementations though is that every std::string or char* parameter would be considered unicode-compatible.

I am also against the "use what you want" approach. I see no reason for such liberty. There's enough confusion on the subject of text, resulting in all this broken software. Having above said, I am convinced that programmers must finally reach consensus on UTF-8 as one proper way. (I come from a non-ascii-speaking country and grew up on Windows, so I'd be last expected to attack UTF-16 based on religious grounds).

I'd like to share more information on how I do text on Windows, and what I recommend to everyone else for compile-time checked unicode correctness, ease of use and better multi-platformness of the code. The suggestion substantially differs from what is usually recommended as the proper way of using Unicode on windows. Yet, in depth research of these recommendations resulted in the same conclusion. So here goes:

  • Do not use wchar_t or std::wstring in any place other than adjacent point to APIs accepting UTF-16.
  • Don't use _T("") or L"" UTF-16 literals (These should IMO be taken out of the standard, as a part of UTF-16 deprecation).
  • Don't use types, functions or their derivatives that are sensitive to the _UNICODE constant, such as LPTSTR or CreateWindow().
  • Yet, _UNICODE always defined, to avoid passing char* strings to WinAPI getting silently compiled
  • std::strings and char* anywhere in program are considered UTF-8 (if not said otherwise)
  • All my strings are std::string, though you can pass char* or string literal to convert(const std::string &).
  • only use Win32 functions that accept widechars (LPWSTR). Never those which accept LPTSTR or LPSTR. Pass parameters this way:

    ::SetWindowTextW(Utils::convert(someStdString or "string litteral").c_str())
    

    (The policy uses conversion functions below.)

  • With MFC strings:

    CString someoneElse; // something that arrived from MFC. Converted as soon as possible, before passing any further away from the API call:
    
    std::string s = str(boost::format("Hello %s\n") % Convert(someoneElse));
    AfxMessageBox(MfcUtils::Convert(s), _T("Error"), MB_OK);
    
  • Working with files, filenames and fstream on Windows:

    • Never pass std::string or const char* filename arguments to fstream family. MSVC STL does not support UTF-8 arguments, but has a non-standard extension which should be used as follows:
    • Convert std::string arguments to std::wstring with Utils::Convert:

      std::ifstream ifs(Utils::Convert("hello"),
                        std::ios_base::in |
                        std::ios_base::binary);
      

      We'll have to manually remove the convert, when MSVC's attitude to fstream changes.

    • This code is not multi-platform and may have to be changed manually in the future
    • See fstream unicode research/discussion case 4215 for more info.
    • Never produce text output files with non-UTF8 content
    • Avoid using fopen() for RAII/OOD reasons. If necessary, use _wfopen() and WinAPI conventions above.

// For interface to win32 API functions
std::string convert(const std::wstring& str, unsigned int codePage /*= CP_UTF8*/)
{
    // Ask me for implementation..
    ...
}

std::wstring convert(const std::string& str, unsigned int codePage /*= CP_UTF8*/)
{
    // Ask me for implementation..
    ...
}

// Interface to MFC
std::string convert(const CString &mfcString)
{
#ifdef UNICODE
    return Utils::convert(std::wstring(mfcString.GetString()));
#else
    return mfcString.GetString();   // This branch is deprecated.
#endif
}

CString convert(const std::string &s)
{
#ifdef UNICODE
    return CString(Utils::convert(s).c_str());
#else
    Exceptions::Assert(false, "Unicode policy violation. See W569"); // This branch is deprecated as it does not support unicode
    return s.c_str();   
#endif
}
[1] http://www.visionmap.com
[2] http://blogs.msdn.com/michkap/archive/2005/12/21/506248.aspx

(2) I would like to add a little comment. Most of Win32 "ASCII" functions receive locale strings in local encodings. For example std::ifstream can accept Hebrew file name if locale encoding is Hebrew one like 1255. Anything needed to support these encodings for windows is make MS add UTF-8 code page to the system. This would make the life much simpler. So all "ASCII" functions would be fully Unicode capable. - Artyom
FWIW the AfxMessageBox(MfcUtils::Convert(s), _T("Error"), MB_OK) example should probably really have been a call to a wrapper of that function that accepts std::string(s). Also, the Assert(false) in the functions toward the end should be replaced with static assertions. - gigantt.com
(10) I can't agree. The advantages of utf16 over utf8 for many Asian languages completely dominate the points you make. It is naive to hope that the Japanese, Thai, Chinese, etc. are going to give up this encoding. The problematic clashes between charsets are when the charsets mostly seem similar, except with differences. I suggest standardising on: fixed 7bit: iso-irv-170; 8bit variable: utf8; 16bit variable: utf16; 32bit fixed: ucs4. - Charles Stewart
(11) @Charles: thanks for your input. True, some BMP characters are longer in UTF-8 than in UTF-16. But, let's face it: the problem is not in bytes that BMP Chinese characters take, but the software design complexity that arises. If a Chinese programmer has to design for variable-length characters anyway, it seems like UTF-8 is still a small price to pay compared to other variables in the system. He might use UTF-16 as a compression algorithm if space is so important, but even then it will be no match for LZ, and after LZ or other generic compression both take about the same size and entropy. - Pavel Radzivilovsky
(6) What I basically say is that simplification offered by having One encoding that is also compatible with existing char* programs, and is also the most popular today for everything is unimaginable. It is almost like in good old "plaintext" days. Want to open a file with a name? No need to care what kind of unicode you are doing, etc etc. I suggest we, developers, confine UTF-16 to very special cases of severe optimization where a tiny bit of performance is worth man-months of work. - Pavel Radzivilovsky
(2) Well, if I had to choose between UTF-8 and UTF-16, I would definitely stick to UTF-8 as it has no BOM, ASCII-compliant and has the same encoding scheme for any plane. But I have to admit that UTF-16 is simpler and more efficient for most BMP characters. There's nothing worng with UTF-16 except the psychological aspects (mostly fixed-size isn't fixed size). Sure, one encoding would be better, but since both UTF-8 and UTF-16 are widely used, they have their advantages. - Malcolm
(1) @Malcolm: UTF-8, unfortunately, has a BOM too (0xEFBBBF). As silly as it looks (no byte order problem with single-byte encoding), this is true, and it is there for a different reason: to manifest this is a UTF stream. I have to disagree with you about BMP efficiency and UTF-16 popularity. It seems that majority of UTF-16 software do not support it properly (ex. all win32 API - which I am a fan of) and this is inherent, the easiest way to fix these seems to switch them to other encoding. The efficiency argument is only true for very narrow set of uses (I use hebrew, and even there it is not). - Pavel Radzivilovsky
Well, what I meant is that you don't have to worry about byte order. UTF-8 can have a BOM indeed (it is actually UTF-16 big endian BOM encoded in 3 bytes), though it's neither required, nor recommended according to the standard. As for the APIs I think the problem is that they were designed when surrogate pairs were either non-existent yet, or not really adopted. And when something gets patched up, it's always not as good as redesigning from the scratch. The only (painful) way is to drop any backwards compability and redesign the APIs. Should they switch to UTF-8 in the process, I don't know. - Malcolm
@Malcolm, I think the natural way of this redesign is thru changing existing ANSI APIs. This way existing broken programs will unbreak (see my answer). This adds to the argument: UTF-16 must die. - Pavel Radzivilovsky
(1) I'm sorry, I didn't really get the idea why transition to UTF-8 should be less painful. I also think that inconsistency in C++ makes it worse. Say, Java is very specific on the characters: char[] is no more than a char array, String is a string and Character is a character. Meanwhile, C++ is a mess with all the new stuff added to an existing language. To my mind, they should've abandoned any backwards compablity and design C++ in the way that doesn't allow to mix up structural programming and OOP or Unicode and other encodings. Not that I want to start a holy war, that's merely my opinion. - Malcolm
(2) UTF-8's disadvantage is NOT a small price to pay at all. Looking for any character is a O(n) operation, and other more complex operations can be far far worse than with UTF-16. Also UTF-8 is variable-length, just as UTF-16, so what's the point? UTF-8 was designed for storage and interoperability with ASCII. UTF-16 is the preferred way to store strings in memory, as anything outside the BMP is incredibly rare (you're wiring in Klingon?). With a little trick, storing characters outside of the BMP in a hash or map, UTF-16 can have constant processing time. - iconiK
(2) @iconiK: non-english BMP is also quite rare. Consider all program sources and markup languages. One should have very good reasons to use UTF-16. See what is going on in Linux world wrt unicode to measure the price of breaking changes. - Pavel Radzivilovsky
(8) Linux has had a specific requirement when choosing to use UTF-8 internally: compatibility with Unix. Windows didn't need that, and thus when the developers implemented Unicode, they added UCS-2 versions of almost all functions handling text and made the multibyte ones simply convert to UCS-2 and call the other ones. THey later replaces UCS-2 with UTF-16. Linux on the other hand kept to 8-bit encodings and thus used UTF-8, as it's the proper choice in that case. - iconiK
(1) you may wish to read my answer again. Windows does not support UTF-16 properly to date. Also, the reason for choosing UCS-2 was different. Again, see my answer. For linux, I believe the main reason was compatibility not with unix but with existing code - for instance, if your ANSI app copies files, getting names from command arguments and calling system APIs, it will remain completely intact with UTF-8. Isn't that wonderful? - Pavel Radzivilovsky
(2) @Pavel: The bug you linked to (Michael Kaplan's blog entry) has long been resolved by now. Michael said in the post already that it's fixed in Vista and I can't reproduce it on Windows 7 as well. While this doesn't fix legacy systems running on XP, saying that »there is still no proper support« is plain wrong. - Joey
@Johannes: [1] many thanks for the info. [2] IMO a programmer, today, should be able to write programs that support windows XP. It is still a popular one, and I don't know of a windows update that fixes it. - Pavel Radzivilovsky
Well, the program works just fine; it just has a little trouble dealing with astral planes, but that's an OS issue, not one with your program. It's like asking that current versions of Uniscribe are backported to old OSes that people on XP can enjoy a few scripts that would render improperly before. It's not something MS does. Besides, XP is almost a decade old by now and supporting it becomes a major burden in some cases (see for example the reasoning why Paint.NET will require Vista with its 4.0 release). Mainstream support for that OS has already ended, too; only security bugs are fixed now - Joey
(1) Still not convincing to use UTF-16 for in-memory presentation of strings on windows :) I wish Windows7 guys would extend their support of already existing #define of CP_UTF8 instead.. - Pavel Radzivilovsky
(1) @Pavel Radzivilovsky: I fail to see how your code, using UTF-8 everywhere, will protect you from bugs in the Windows API? I mean: You're copying/converting strings for all calls to the WinAPI that use them, and still, if there is a bug in the GUI, or the filesystem, or whatever system handled by the OS, the bug remains. Now, perhaps your code has a specific UTF-8 handling functions (search for substrings, etc.), but then, you could have written them to handle UTF-16 instead, and avoid all this bloated code (unless you're writing cross-platform code... There, UTF-8 could be a sensible choice) - paercebal
(5) @Pavel Radzivilovsky: BTW, your writings about "I believe that all other encodings will die eventually. This involves that MS-Windows, Java, ICU, python stop using it as their favorite." and "In particular, I think adding wchar_t to C++ was a mistake, and so are the unicode additions to C++Ox." are either quite naive or very very arrogant. And this is coming from someone coding at home with a Linux and who is happy with the UTF-8 chars. To put it bluntly: It won't happen. - paercebal
@paercebal: If majority of the code is API calls, this is a very simple code. Typically, majority of code dealing with strings is libraries that treat them as cookies, and they are optimized for. Hence, the bloating argument fails. As for the 'favorite utf16' for ICU and python, this is very questionable: these tools use UTF-16 internally, and changing it as a part of the evolution is the easiest. Can happen on any major release, coz it doesn't break the interfaces. - Pavel Radzivilovsky
(2) In ICU we already see more and more UTF-8 interfaces and optimizations. However, UTF-16 works perfectly well, and makes complicated lookup efficient, more than with UTF-8. We will not see ICU drop UTF-16 internally. UTF-16 in memory, UTF-8 on the wire and on disk. All is good. - Steven R. Loomis
@Steven, It looks like differentiating between wire and RAM is not a small thing as it may seem. BTW, comparison is cheaper with UTF8. I agree that ICU is certainly a major player on this market, and there's no need to "drop" support of anything. The simplification of application design and testing with UTF8 is exactly what will, in my humble opinion, drive UTf-16 to extinction, and the sooner the better. - Pavel Radzivilovsky
(2) @Pavel Radzivilovsky I meant, drop UTF-16 as the internal processing format. Can you expand on 'not a small thing'? And, anyways, UTF-16/UTF-8/UTF-32 have a 1:1:1 mapping. I'm much more interested in seeing non-Unicode encodings die. As far as UTF-8 goes for simplification, you say "they can just pass strings as char*"- right, and then they assume that the char* is some ASCII-based 8-bit encoding. Plenty of errors creep in when toupper(), etc, is used on UTF-8. It's not wonderful, but it is helpful. - Steven R. Loomis
@Steve First and foremost I agree about non-unicode. There's no argument about that. Practically, it already happened, they are already dead, in this exact sense: any non-unicode operation on a string is considered a bug just like any other software bug, or a 'text crime' in my company's slang. It is true that char* is misleading many into unicode bugs as well. Good luck with toupper() a UTF-8 string, or, say, with assuming that ICU toupper does not change the number of characters (as in german eszet converting to SS). After the standard has been established, there's no more reason to bug. - Pavel Radzivilovsky
(4) @Steve, 2; and then we come to a more subtle thing, which is everything around human engineering and safety and designing proper way of work for a developer to do less and for the machine to do more. This is exactly where UTF-16 doesn't fit. Most applications do not reverse or even sort strings. Most often strings are treated as cookies, such as a file name here and there, concatenated here and there, embedded programming languages such as SQL and other really simple transformations. In this world, there's very little reason to have different format in RAM than on the wire. - Pavel Radzivilovsky
@Pavel: "...that widechar is going to be what UCS-4 now is." this is incorrect in general since widechar is not fixed to be 2 bytes in size, unless you restrict yourself to Windows. You should write "UCS-2" instead of "widechar". - ybungalobill
@ybungalobill Right; I should edit this. In fact, I will do this when wchar_t is standardized to hold one Unicode character. - Pavel Radzivilovsky
(2) @Pavel: In fact your sentence is just wrong, because wchar_t is not meant to be UTF-16, it has absolutely no connection to UTF-16, and it is already UCS-4 on some compilers. wchar_t is a C++ type that is actually (quote from the standard) "a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales". So the only problem here is your system (windows) that doesn't have a UCS-4 locale. - ybungalobill
(3) I'm mostly impressed with how this long rant completely fails in arguing its point, at least outside the narrow world of having to deal with UTF-16 in C and pointers. That might be considered dangerous, but that is if anything C's fault, not UTF-16. - Lennart Regebro
(2) Well, as I mentioned earlier, I didn't find this post very convincing either. This post goes into details of handling UTF-16 in certain APIs or languages. If the software doesn't handle the standard properly, that's a problem. But what's wrong with the encoding itself anyway? If some software implements only half of the standard, that's not the standard's problem. - Malcolm
There are so many things that are wrong in the bullets that can't even be captures in a comment. But probably the most dangerous one is to store UTF-8 in std::string in a Windows environment. Problem is, everything in the Windows world assumes that char* are strings in the current system code page. Use one wrong API on that string, and you are assured of many hours of debugging. The other problem is the religious recommendation for UTF-8 no matter what. "there's a huge advantage to using the same encoding everywhere, and I see no sufficient reason to do otherwise" is pushed, no advantage given - Mihai Nita
Wow, that's a really insightful comment. I'll start converting all my apps to UTF-8 right now. Thanks! - asdf
@Mihai there is this advantage. you’ll start noticing it when you don’t do it and get cryptic runtime encoding exceptions nobody can possibly understand nor track back to it’s source. python 3 has made the jump, and guess what: the frequent encoding issues i had in python 2 magically disappeared completely. - flying sheep
1
[+33] [2010-03-18 01:48:04] Daniel Newby

Unicode codepoints are not characters! Sometimes they are not even glyphs (visual forms).

Some examples:

  • Roman numeral codepoints like "ⅲ". (A single character that looks like "iii".)
  • Accented characters like "á", which can be represented as either a single combined character "\u00e1" or a character and separated diacritic "\u0061\u0301".
  • Characters like Greek lowercase sigma, which have different forms for middle ("σ") and end ("ς") of word positions, but which should be considered synonyms for search.
  • Unicode discretionary hyphen U+00AD, which might or might not be visually displayed, depending on context, and which is ignored for semantic search.

The only ways to get Unicode editing right is to use a library written by an expert, or become an expert and write one yourself. If you are just counting codepoints, you are living in a state of sin.


(3) This. Very much this. UTF-16 can cause problems, but even using UTF-32 throughout can (and will) still give you issues. - bcat
What is a character? You can define a code point as a character and get by pretty much just fine. If you mean a user-visible glyph, that’s something else. - tchrist
2
[+27] [2009-07-24 08:21:31] Mihai Nita

There is a simple rule of thumb on what Unicode Transformation Form (UTF) to use: - utf-8 for storage and comunication - utf-16 for data processing - you might go with utf-32 if most of the platform API you use is utf-32 (common in the UNIX world).

Most systems today use utf-16 (Windows, Mac OS, Java, .NET, ICU, Qt). Also see this document: http://unicode.org/notes/tn12/

Back to "UTF-16 as harmful", I would say: definitely not.

People who are afraid of surrogates (thinking that they transform Unicode into a variable-length encoding) don't understand the other (way bigger) complexities that make mapping between characters and a Unicode code point very complex: combining characters, ligatures, variation selectors, control characters, etc.

Just read this series here http://blogs.msdn.com/michkap/archive/2009/06/29/9800913.aspx and see how UTF-16 becomes an easy problem.


Would upvote twice if I could. - Andrey Tarantsov
(2) Please add some examples where UTF-32 is common in the UNIX world! - maxschlepzig
No, you do not want to use UTF-16 for data processing. It's a pain in the ass. It has all the disadvantages of UTF-8 but none of its advantages. Both UTF-8 and UTF-32 are clearly superior to the vicious hack previously known as Mrs UTF-16, whose maiden name was UCS-2. - tchrist
I yesterday just found a bug in the Java core String class’s equalsIgnoreCase method (also others in the string class) that would never have been there had Java used either UTF-8 or UTF-32. There are millions of these sleeping bombshells in any code that uses UTF-16, and I am sick and tired of them. UTF-16 is a vicious pox that plagues our software with insidious bugs forever and ever. It is clearly harmful, and should be deprecated and banned. - tchrist
3
[+24] [2009-06-26 16:14:27] JacquesB

There is nothing wrong with Utf-16 encoding. But languages that treat the 16-bit units as characters should probably be considered badly designed. Having a type named 'char' which does not always represent a character is pretty confusing. Since most developers will expect a char type to represent a code point or character, much code will probably break when exposed to characters beyound BMP.

Note however that even using utf-32 does not mean that each 32-bit code point will always represent a character. Due to combining characters, an actual character may consist of several code points. Unicode is never trivial.

BTW. There is probably the same class of bugs with platforms and applications which expect characters to be 8-bit, which are fed Utf-8.


(6) In Java's case, if you look at their timeline (java.com/en/javahistory/timeline.jsp), you see that the primarily development of String happened while Unicode was 16 bits (it changed in 1996). They had to bolt on the ability to handle non BMP code points, thus the confusion. - Kathy Van Stone
(3) @Kathy: Not really an excuse for C#, though. Generally, I agree, that there should be a CodePoint type, holding a single code point (21 bits), a CodeUnit type, holding a single code unit (16 bits for UTF-16) and a Character type would ideally have to support a complete grapheme. But that makes it functionally equivalent to a String ... - Joey
This answer is almost two years old, but I can't help but comment on it. "Having a type named 'char' which does not always represent a character is pretty confusing." And yet people use it all the time in C and the like to represent integer data that can be stored in a single byte. - JAB
4
[+17] [2009-06-26 16:09:36] patjbs

I would suggest that thinking UTF-16 might be considered harmful says that you need to gain a greater understanding of unicode [1].

Since I've been downvoted for presenting my opinion on a subjective question, let me elaborate. What exactly is it that bothers you about UTF-16? Would you prefer if everything was encoded in UTF-8? UTF-7? Or how about UCS-4? Of course certain applications are not designed to handle everysingle character code out there - but they are necessary, especially in today's global information domain, for communication between international boundaries.

But really, if you feel UTF-16 should be considered harmful because it's confusing or can be improperly implemented (unicode certainly can be), then what method of character encoding would be considered non-harmful?

EDIT: To clarify: Why consider improper implementations of a standard a reflection of the quality of the standard itself? As others have subsequently noted, merely because an application uses a tool inappropriately, does not mean that the tool itself is defective. If that were the case, we could probably say things like "var keyword considered harmful", or "threading considered harmful". I think the question confuses the quality and nature of the standard with the difficulties many programmers have in implementing and using it properly, which I feel stem more from their lack of understanding how unicode works, rather than unicode itself.

[1] http://www.joelonsoftware.com/articles/Unicode.html

(6) -1: How about addressing some of Artyom's objections, rather than just patronising him? - RichieHindle
(3) BTW: When I started writing this article I almost wanted to write "Does Joel on Softeare article of Unicode should be considered harmful" because there are many mistakes. For example: utf-8 encoding takes up to 4 characters and not 6. Also it does not distinguish between UCS-2 and UTF-16 that are really different -- and actually cause the problems I talk about. - Artyom
My point is that those character points are designed and implemented for specific tasks. The "bugs" you describe are no different than the "bugs" one would encounter if you attempted to give input outside the scope of any application. - patjbs
(1) I agree with the last edit. The simplest example: we still use C and C++ though both languages use pointers and thus are not safe. - Malcolm
(9) Also, it should be noted that when Joel wrote that article, the UTF-8 standard WAS 6 bytes, not 4. RFC 3629 changed the standard to 4 bytes several months AFTER he wrote the article. Like most anything on the internet, it pays to read from more than one source, and to be aware of the age of your sources. The link wasn't intended to be the "end all be all", but rather a starting point. - patjbs
Actually, the problem is not with the standard. It is 100% ok. In fact, there are good implementations that work with utf-16: ICU, Java Swing etc. But, the problem is that there are too much basic bugs in processing of surragate pairs when working with utf-16, such, you should probably never pic utf-16 for internal encoding of new applications... Because there are lot of real life examples where utf-16 nature causes big troubles: even Stackoverlow can't deal with them - Artyom
Not to try and flog a dead horse here, but if you shouldn't pick utf-16 as the reasonable standard, what should you pick? I'm interested in your perspective on what an acceptable alternative would be. For instance, a lot of my work involves working with ancient languages (greek, aramaic, hebrew, syriac, etc), and work a lot with these oddball unicode characters, so I'm constantly having to transition documents between utf-8, 16 and 32. - patjbs
(4) I would pic: utf-8 or utf-32 that are: variable length encoding in almost all cases (including BMP) or fixed length encoding always. - Artyom
(1) Artyom, SO doesn't NEED to use UTF-16, since UTF-8 is the de facto standard for storage and communication of text, while UTF-16 is the de facto standard for processing of text. I don't know of any web page using UTF-16, and it wouldn't be really bold to do so, especially since a really popular language has no Unicode support: PHP (and UTF-16 isn't really easy to deal with; UTF-8 is the standard encoding in most Linux installs, where PHP is commonly run). - iconiK
@iconiK: Don’t be silly. UTF-16 is absolutely not the de facto standard for processing text. Show me a programming lanuage more suited to text processing that Perl, which has always (well, for more than a decade) used abstract characters with an underlying UTF-8 representation internally. Because of this, every Perl program automatically handles all Unicode without the user having to constantly monkey around with idiotic surrogates. The length of a string is its count in code points, not code units. Anything else is sheer stupidity putting the backwards into backwards compatibility. - tchrist
5
[+10] [2009-06-26 16:16:10] Malcolm

Well, there is an encoding that uses fixed-size symbols. I certainly mean UTF-32. But 4 bytes for each symbol is too much of wasted space, why would we use it in everyday situations?

Actually I don't undesrstand why it's so big deal anyway. Characters outside BMP are encountered only in very specific cases and areas. According to the Unicode official FAQ [1], "even in East Asian text, the incidence of surrogate pairs should be well less than 1% of all text storage on average". Of course, characters outside BMP shouldn't be neglected, but most programs which use Unicode are not intended for working with texts containing such characters. That's why if they don't support it, it is unpleasant, but not a catastrophy.

To my mind, most problems appear from the fact that some software fell behind the Unicode standard, but were not quick to correct the situation. Opera, Windows, Python, Qt - all of them appeared before UTF-16 became widely known or even came into existence. I can confirm, though, that in Opera, Windows Explorer, and Notepad there are no problems with characters outside BMP anymore (at least on my PC). And if you choose to create your own implementation of decoder/encoder today, you will almost certainly know about the concept of surrogate pairs in UTF-16.

Also it is wrong to think that it is easy to mess up with determining the string length only in UTF-16. If you use UTF-8 or UTF-32, you still should be aware that one Unicode code point doesn't necessarily mean one character.

Therefore I don't think it should be considered harmful. UTF-16 is a compromise between simplicity and compactness, and there's no harm in using what is needed where it is needed. In some cases you need to remain compatible with ASCII and you need UTF-8, in some cases you want to work with work with Han ideographs and conserve space using UTF-16, in some cases you need universal representations of characters usign a fixed-length encoding. Use what's more appropriate, just do it properly.

[1] http://www.unicode.org/faq//utf_bom.html#utf16-5

(1) If a program uses UTF-16, shouldn't it be used "correctly"? - Albert
(2) Certainly. But that doesn't mean that if someone can use something incorrectly, we shouldn't use it at all, right? - Malcolm
(15) That's a rather blinkered, Anglo-centric view, Malcolm. Almost on a par with "ASCII is good enough for the USA - the rest of the world should fit in with us". - Jonathan Leffler
(18) Actually I'm from Russia and encounter cyrillics all the time (including my own programs), so I don't think that I have Anglo-centric view. :) Mentioning ASCII is not quite appropirate, because it's not Unicode and doesn't support specific characters. UTF-8, UTF-16, UTF-32 support the very same international character sets, they are just intended for use in their specific areas. And this is exactly my point: if you use mostly English, use UTF-8, if you use mostly cyrillics, use UTF-16, if you use ancient languages, use UTF-32. Quite simple. - Malcolm
(4) But you might not know in advance if your application need to handle characters outside BMP, if the application accepts data like names. For example some asian names might be written with characters outside of BMP. - JacquesB
(2) Not true, Asian scripts like Japanese, Chinese or Arabic belong to BMP also. BMP itself is actually very large and certainly large enough to include all the scripts used nowadays, it's not like it includes only European scripts or something. No, if you are really going to encounter non-BMP characters, you'll almost definitely know it. - Malcolm
(1) @Malcolm: The issue is more complex than that. See eg. jbrowse.com/text/unij.html - JacquesB
And what did I write wrong? All characters of plane 2 contain only rare or historic symbols and all other characters fit into BMP and thus don't need surrogate pairs. - Malcolm
(1) @Malcolm: This issue is that some people apparently have names containing these rare symbols, even though they does not otherwise occur in regular language. - JacquesB
There is, but it's not really a problem specific for Unicode since standart encodings also don't include this characters. People use homophones and other ways to write such names, and that can be done in any encoding, including Unicode. Probably there are serious difficulties even with inputting rare symbols, so the situation doesn't happen all of a sudden and users won't be surprised to find out the program is refusing to accept them correctly if it does. - Malcolm
(6) "Not true, Asian scripts like Japanese, Chinese or Arabic belong to BMP also. BMP itself is actually very large and certainly large enough to include all the scripts used nowadays" This is all so wrong. BMP contains 0xFFFF characters (65536). Chinese alone has more than that. Chinese standards (GB 18030) has more than that. Unicode 5.1 already allocated more than 100,000 characters. - Mihai Nita
It does, but characters outside BMP are not for everyday use, they can be used, for example, for old texts or to write names with rare hieroglyphs in them. And all characters that are commonly used fit into BMP. - Malcolm
(5) @Marcolm: "BMP itself is actually very large and certainly large enough to include all the scripts used nowadays" Not true. At this point Unicode already allocated about 100K characters, way more than BMP can accomodate. There are big chunks of Chinese characters outside BMP. And some of them are required by GB-18030 (mandatory Chinese standard). Other are required by (non-mandatory) Japanese and Korean standards. So if you try to sell anything in those markets, you need beyond BMP support. - Mihai Nita
(2) If BMP is that far from having enough capacity to write normally in Chinese, how do they manage to write in such encodings as GBK or GB 2312? It is clear that support of other planes would be useful, but nonetheless. - Malcolm
All the currently used languages in the world fir in the BMP, in 64k code points. Anything outside of the BMP is not for current use of the language; it's for old characters, for old languages, for exotic characters, or even Klingon. If Chinese and/or Japanese and/or Korean need characters out of the BMP, how did they handle this before Unicode was widely adopted? Nearly all the encodings used in Asia were variable-length, using 8 or 16 bits per character. - iconiK
(1) You DO NOT want Klingon users to be angry at you ;-) - Beni Cherniavsky-Paskin
Why would they be angry? - Malcolm
Anything that uses UTF-16 but can only handle narrow BMP characters is not actually using UTF-16. It is buggy and broken. The premise of the OP is sound: UTF-16 is harmful, because it leads naïve people into writing broken code. Either you can handle Unicode text, or you can’t. If you cannot, then you are picking a subset, which is just as stupid as ASCII-only text processing. - tchrist
@tchrist Actually I think that most problems appear from the dated software which was designed for UCS-2. If you implement the standard today, you will be almost certainly aware that UTF-16 has a concept of surrogate pairs, since it is written about almost everywhere, starting with Wikipedia. And if you are an ignorant developer, you may not implement support for characters outside BMP in UTF-8 as well. Or even treat every text as if each byte represented only one character. - Malcolm
@Malcolm: I have never seen a Go or Perl programmer get BMP screwups, and those languages both use UTF-8 internally. In contrast, in every language that uses UTF-16 I have seen people major screwups on BMP stuff everywhere I look. Where have you actually seen BMP screwups with UTF8-based programming languages? That seems utterly bizarre. Yes, you can do UTF-16 right, if you are genius smart and working on ICU. But for regular people, UTF-8 and UTF-32 are way less prone to error. - tchrist
@tchrist This is all too general. First of all, define what is a "BMP screwup" and where you look for them. Also, I've already said that there's a huge difference between something that was designed for UTF-16 and something that was designed for UCS-2 and then switched to UTF-16. And if we put aside API and language deficiencies, I don't reallly see what's so terribly difficult in handling surrogate pairs, especially in comparison with UTF-8. - Malcolm
@Malcolm: The BMP screwup was iterating through a Java String a char at a time instead of code point at a time in the String class’s equalsIgnoreCase method. The code was never updated for UTF-16 so was stuck in UCS-2 brain damage, so does the wrong thing on anything outside the BMP. Plus it was using casemapping not casefolding, which is bound to get it into trouble. "𐐼𐐯𐑅𐐨𐑉𐐯𐐻", "𐐔𐐯𐑅𐐨𐑉𐐯𐐻", and "𐐔𐐇𐐝𐐀𐐡𐐇𐐓" are all casewise equal to each other, but the dumb UCS-2 Java method was too stupid to know that. ICU gets this right, and they use UTF-16 also, but not stupidly. - tchrist
Yes, equalsIgnoreCase() compares chars, not codepoints, and it is stated in the docs. I certainly agree that if this method compared strings usign code points, it would be much simpler. But this isn't a problem of UTF-16 itself, it is a problem of a platform which was originally designed for UCS-2 - exactly what I'm talking about. - Malcolm
@Malcom: I don’t believe that merely documenting something not to work on 16/17th of the Unicode space is acceptable. The correct solution is to make it do so. - tchrist
Consider it non-Unicode, that would clear the confusion. Historically this method is simply not meant to work reliably in conditions where I18N is required, it is not even locale-aware. Java has different facilities to do that. And I have to remind you that we're off the topic, still talking about subtleties of APIs, not about the UTF-16 itself. - Malcolm
6
[+10] [2009-06-26 17:42:31] JasonTrue

Years of Windows internationalization work especially in East Asian languages might have corrupted me, but I lean toward UTF-16 for internal-to-the-program representations of strings, and UTF-8 for network or file storage of plaintext-like documents. UTF-16 can usually be processed faster on Windows, though, so that's the primary benefit of using UTF-16 in Windows.

Making the leap to UTF-16 dramatically improved the adequacy of average products handling international text. There are only a few narrow cases when the surrogate pairs need to be considered (deletions, insertions, and line breaking, basically) and the average-case is mostly straight pass-through. And unlike earlier encodings like JIS variants, UTF-16 limits surrogate pairs to a very narrow range, so the check is really quick and works forward and backward.

Granted, it's roughly as quick in correctly-encoded UTF-8, too. But there's also many broken UTF-8 applications that incorrectly encode surrogate pairs as two UTF-8 sequences. So UTF-8 doesn't guarantee salvation either.

IE handles surrogate pairs reasonably well since 2000 or so, even though it typically is converting them from UTF-8 pages to an internal UTF-16 representation; I'm fairly sure Firefox has got it right too, so I don't really care what Opera does.

UTF-32 (aka UCS4) is pointless for most applications since it's so space-demanding, so it's pretty much a nonstarter.


(1) I didn't quite get your comment on UTF-8 and surrogate pairs. Surrogate pairs is only a concept that is meaningful in the UTF-16 encoding, right? Perhaps code that converts directly from UTF-16 encoding to UTF-8 encoding might get this wrong, and in that case, the problem is incorrectly reading the UTF-16, not writing the UTF-8. Is that right? - Craig McQueen
(6) What Jason's talking about is software that deliberately implements UTF-8 that way: create a surrogate pair, then UTF-8 encode each half separately. The correct name for that encoding is CESU-8, but Oracle (e.g.) misrepresents it as UTF-8. Java employs a similar scheme for object serialization, but it's clearly documented as "Modified UTF-8" and only for internal use. (Now, if we could just get people to READ that documentation and stop using DataInputStream#readUTF() and DataOutputStream#writeUTF() inappropriately...) - Alan Moore
7
[+9] [2009-06-26 16:49:39] rmeador

My personal choice is to always use UTF-8. It's the standard on Linux for nearly everything. It's backwards compatible with many legacy apps. There is a very minimal overhead in terms of extra space used for non-latin characters vs the other UTF formats, and there is a significant savings in space for latin characters. On the web, latin languages reign supreme, and I think they will for the foreseeable future. And to address one of the main arguments in the original post: nearly every programmer is aware that UTF-8 will sometimes have multi-byte characters in it. Not everyone deals with this correctly, but they are usually aware, which is more than can be said for UTF-16. But, of course, you need to choose the one most appropriate for your application. That's why there's more than one in the first place.


(1) UTF-16 is simpler for anything inside BMP, that's why it is used so widely. But I'm a fan of UTF-8 too, it also has no problems with byte order, which works to its advantage. - Malcolm
@Malcolm: UTF-16 also has no problems with byte order as it requires a BOM which specifies the order :-) - Joey
Theoretically, yes. In practice there are such things as, say, UTF-16BE, which means UTF-16 in big endian without BOM. This is not some thing I made up, this is an actual encoding allowed in ID3v2.4 tags (ID3v2 tags suck, but are, unfortunately, widely used). And in such cases you have to define endianness externally, because the text itself doesn't contain BOM. UTF-8 is always written one way and it doesn't have such a problem. - Malcolm
No, UTF-16 is not simpler. It is harder. It misleads and deceives you into thinking it is fixed width. All such code is broken and all the moreso because you don’t notice until it’s too late. CASE IN POINT: I just found yet another stupid UTF-16 bug in the Java core libraries yesterday, this time in String.equalsIgnoreCase, which was left in UCS-2 braindeath buggery, and so fails on 16/17 valid Unicode code points. How long has that code been around? No excuse for it to be buggy. UTF-16 leads to sheer stupidity and an accident waiting to happen. Run screaming from UTF-16. - tchrist
@tchrist One must be a very ignorant developer to not know that UTF-16 is not fixed length. If you start with Wikipedia, you will read the following at the very top: "It produces a variable-length result of either one or two 16-bit code units per code point". Unicode FAQ says the same: unicode.org/faq//utf_bom.html#utf16-1. I don't know, how UTF-16 can deceive anybody if it is written everywhere that it is variable length. As for the method, it was never designed for UTF-16 and shouldn't be considered Unicode, as simple as that. - Malcolm
8
[+5] [2010-03-18 02:55:43] Tronic

UTF-8 is definitely the way to go, possibly accompanied by UTF-32 for internal use in algorithms that need high performance random access (but that ignores combining chars).

Both UTF-16 and UTF-32 (as well as their LE/BE variants) suffer of endianess issues, so they should never be used externally.


(3) Constant time random access is possible with UTF-8 too, just use code units rather than code points. Maybe you need real random code point access, but I've never seen a use case, and you're just as likely to want random grapheme cluster access instead. - Rhamphoryncus
9
[+5] [2010-08-01 05:25:20] dan04

I wouldn't necessarily say that UTF-16 is harmful. It's not elegant, but it serves its purpose of backwards compatibility with UCS-2, just like GB18030 does with GB2312, and UTF-8 does with ASCII.

But making a fundamental change to the structure of Unicode in midstream, after Microsoft and Sun had built huge APIs around 16-bit characters, was harmful. The failure to spread awareness of the change was more harmful.


(2) UTF-8 is a superset of ASCII, but UTF-16 is NOT a superset of UCS-2. Although almost a superset, a correct encoding of UCS-2 into UTF-8 results in the abomination known as CESU-8; UCS-2 doesn't have surrogates, just ordinary code points, so they must be translated as such. The real advantage of UTF-16 is that it's easier to upgrade a UCS-2 codebase than a complete rewrite for UTF-8. Funny, huh? - Rhamphoryncus
Sure, technically UTF-16 isn't a superset of UCS-2, but when were U+D800 to U+DFFF ever used for anything except UTF-16 surrogates? - dan04
Doesn't matter. Any processing other than blindly passing through the bytestream requires you to decode the surrogate pairs, which you can't do if you're treating it as UCS-2. - Rhamphoryncus
10
[+4] [2009-06-26 17:21:53] Nemanja Trifunovic

UTF-16 is the best compromise between handling and space [1] and that's why most major platforms (Win32, Java, .NET) use it for internal representation of strings.

[1] http://publib.boulder.ibm.com/infocenter/iseries/v6r1m0/index.jsp?topic=/nls/rbagsutf16.htm

(2) -1 because UTF-8 is likely to be smaller or not significantly different. For certain Asian scripts UTF-8 is three bytes per glyph while UTF-16 is only two, but this is balanced by UTF-8 being only one byte for ASCII (which does often appear even within asian languages in product names, commands and such things). Further, in the said languages, a glyph conveys more information than a latin character so it is justified for it to take more space. - Tronic
Thanks for the downvote, but I still don't get which part of the "best compromise between handling and space" you consider wrong. Note the word "compromise". Or maybe you don't believe that Win32, Java and .NET (also ICU, btw) use UTF-16 internally? - Nemanja Trifunovic
(5) I would not call combining the worst sides of both options a good compromise. - Tronic
(1) It is the best of both worlds: it is pretty easy to handle, unlike UTF-8, and does not take nearly as much memory as UTF-32. - Nemanja Trifunovic
(4) It's not easier than UTF-8. It's variable-length too. - luiscubal
(1) It is variable-length, but way easier than UTF-8. With UTF16, the only thing to look at is the surrogate pairs; a UTF-8 code point can be encoded as anyway between 1 and 4 bytes, plus you need to take care of things such as overlong sequences, etc. Look at this code to see how UTF-8 decoding looks like with C++: utfcpp.svn.sourceforge.net/viewvc/utfcpp/v2_0/source/utf8/… - Nemanja Trifunovic
(8) Leaving debates about the benefits of UTF-16 aside: What you cited is not the reason for Windows, Java or .NET using UTF-16. Windows and Java date back to a time where Unicode was a 16-bit encoding. UCS-2 was a reasonable choice back then. When Unicode became a 21-bit encoding migrating to UTF-16 was the best choice existing platforms had. That had nothing to do with ease of handling or space compromises. It's just a matter of legacy. - Joey
@Johannes: It is a matter of legacy in case of Win32 and Java, but not .NET and especially not Python 3. - Nemanja Trifunovic
(2) .NET inherits the Windows legacy here. - Joey
That's why I said "especially not Python 3", but it would have been perfectly feasible to implement even .NET strings as UTF-8. Of course, interop with Win32 is easier with UTF-16 strings. - Nemanja Trifunovic
Python3 and PHP6 are probably a case of "me too", and we all know how well that went with PHP6. - ninjalj
It does not in practice take a great deal more in UTF-8 than in UTF-16. See this case-study. Python3 is still unacceptably dodgy because you cannot rely on a wide build, so you never know how to count characters, or whether it takes "." or ".." to match one in a regex. Look at languages that have always used UTF-8, like Go and Perl, and you will see that they have none of the endless insanity of UTF-16. I just found another Java CORE UTF-16 bug yesteday. - tchrist
11
[+3] [2010-12-21 00:40:01] Yuhong Bao
Thanks, very good link! I've added it to the issues list in the question. - Artyom
12
[+2] [2010-10-19 07:06:46] Patrick Horgan

Someone said UCS4 and UTF-32 were same. No so, but I know what you mean. One of them is an encoding of the other, though. I wish they'd though to specify endianness from the first so we wouldn't have the endianess battle fought out here too. Couldn't they have seen that coming? At least utf-8 is the same everywhere (unless someone is following the original spec with 6-bytes). Sigh. If you use utf-16 you HAVE to include handling for multibyte chars. You can't go to the Nth character by indexing 2N into a byte array. You have to walk it, or have character indices. Otherwise you've written a bug. The current draft spec of C++ says that utf-32 and utf16 can have little-endian, big-endian, and unspecified variants. Really? If Unicode had specified that everyone had to do little-endian from the beginngin then it would have all been simpler. (I would have been fine with big-endian as well.) Instead, some people implemented it one way, some the other, and now we're stuck with silliness for nothing. Sometimes it's embarrassing to be a software engineer.


Unspecified endianess is supposed to include BOM as the first character, used for determining which way the string should be read. UCS-4 and UTF-32 indeed are the same nowadays, i.e. a numeric UCS value between 0 and 0x10FFFF stored in a 32 bit integer. - Tronic
@Tronic: Technically, this is not true. Although UCS-4 can store any 32-bit integer, UTF-32 is forbidden from storing the non-character code points that are illegal for interchange, such as 0xFFFF, 0xFFFE, and the all the surrogates. UTF is a transport encoding, not an internal one. - tchrist
13
[+1] [2010-03-18 01:00:40] David X

UTF-16? definitely harmful. Just my grain of salt here, but there are exactly three acceptable encodings for text in a program:

  • ASCII: when dealing with low level things (eg: microcontrollers) that can't afford anything better
  • UTF8: storage in fixed-width media such as files
  • integer codepoints ("CP"?): an array of the largest integers that are convenient for your programming language and platform (decays to ASCII in the limit of low resorces). Should be int32 on older computers and int64 on anything with 64-bit addressing.

  • Obviously interfaces to legacy code use what encoding is needed to make the old code work right.


Unicode guarantees there will be no codepoints above U+10FFFF. You are talking about UTF-32/UCS-4 (they are identical). If you are thinking about speed, 32->64 is not 16->32; int64 is not faster for 64-processors. - Simon Buchan
(2) @simon buchan, the U+10ffff max will go out the window when (not if) they run out of codepoints. That said, useing int32 on a p64 system for speed is probably safe, since i doubt they'll exceed U+ffffffff before you're forced to rewrite your code for 128 bit systems around 2050. (That is the point of "use the largest int that is convenient" as opposed to "largest available" (which would probably be int256 or bignums or something).) - David X
@David: Unicode 5.2 encodes 107,361 codepoints. There are 867,169 unused codepoints. "when" is just silly. A Unicode codepoint is defined as a number from 0 to 0x10FFFF, a property which UTF-16 depends upon. (Also 2050 seems much to low an estimate for 128 bit systems when a 64-bit system can hold the entirety of the Internet in it's address space.) - Simon Buchan
@Simon, yes, I was thinking 2050 sounded a bit low for either ETA, my point was that yes, "when" is silly, but it will happen. My point in the original answer, however, was to use an array of ints of whatever size is needed for the largest codepoint you expect to handle. (And yes, I did forget that most p64 systems still use int32 as a primary integer type. I'm not sure why.) - David X
(1) @David: Your "when" was referring to running out of Unicode codepoints, not a 128-bit switch which, yes, will be in the next few centuries. Unlike memory, there is no exponential growth of characters, so the Unicode Consortium has specifically guaranteed they will never allocate a codepoint above U+10FFFF. This really is one of those situations when 21 bits is enough for anybody. - Simon Buchan
(4) @Simon Buchan: At least until first contact. :) - dalle
14
[+1] [2009-08-04 13:15:53] ZZ Coder

This totally depends on your application. For most people, UTF-16BE is a good compromise. Other choices are either too expensive to find characters (UTF-8) or waste too much space (UTF-32 or UCS-4, where each character takes 4 bytes).

With UTF-16BE, you can treat it as UCS-2 (fixed length) in most cases. Characters beyond BMP are rare in normal applications. You still have the option to handle surrogate pair if you choose to, say you are writing an archaeology application.


(1) With all widely-used processor architectures being LE (x86, x86-64, IA-64, ARM, etc.), using UTF-16BE would be masochism. - iconiK
(1) Why is it "too expensive" to find characters? - luiscubal
@iconiK ARMs are available in either endianness. The ones with better MMUs allow endianness to be selected on a per-page level, this is similar to PowerPC etc. x86/etc is only the most widely-used in the desktop PC space. - Chris D.
This is all a myth. It is not "too expensive to find characters" in UTF-8. Virtually all string processing is sequential, not random. We lived with O(N) strlen in C forever. This is no hardship at all. - tchrist
15
[+1] [2011-08-01 17:30:55] ninjalj

Yes, absolutely.

Why? It has to do with exercising code.

If you look at these codepoint usage statistics on a large corpus [1] by Tom Christiansen you'll see that trans-8bit BMP codepoints are used several orders if magnitude more than non-BMP codepoints:

 2663710 U+002013 ‹–›  GC=Pd    EN DASH
 1065594 U+0000A0 ‹ ›  GC=Zs    NO-BREAK SPACE
 1009762 U+0000B1 ‹±›  GC=Sm    PLUS-MINUS SIGN
  784139 U+002212 ‹−›  GC=Sm    MINUS SIGN
  602377 U+002003 ‹ ›  GC=Zs    EM SPACE

 544 U+01D49E ‹𝒞›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL C
 450 U+01D4AF ‹𝒯›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL T
 385 U+01D4AE ‹𝒮›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL S
 292 U+01D49F ‹𝒟›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL D
 285 U+01D4B3 ‹𝒳›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL X

Take the TDD dictum: "Untested code is broken code", and rephrase it as "unexercised code is broken code", and think how often programmers have to deal with non-BMP codepoints.

Bugs related to not dealing with UTF-16 as a variable-width encoding are much more likely to go unnoticed than the equivalent bugs in UTF-8. Some programming languages still don't guarantee to give you UTF-16 instead of UCS-2, and some so-called high-level programming languages offer access to code units instead of code-points (even C is supposed to give you access to codepoints if you use wchar_t, regardless of what some platforms may do).

[1] http://stackoverflow.com/questions/5567249/what-are-the-most-common-non-bmp-unicode-characters-in-actual-use/5575000#5575000

16
[0] [2011-01-21 12:06:57] Pavel Machyniak

Unicode defines code points up to 0x10FFFF (1,114,112 codes), all applications running in multilingual environment dealing with strings/file names etc. should handle that correctly.

Utf-16: covers only 1,112,064 codes. Although those at the end of Unicode are from planes 15-16 (Private Use Area). It can not grow any further in the future except breaking Utf-16 concept.

Utf-8: covers theoretically 2,216,757,376 codes. Current range of Unicode codes can be represented by maximally 4 byte sequence. It does not suffer with byte order problem, it is "compatible" with ascii.

Utf-32: covers theoretically 2^32=4,294,967,296 codes. Currently it is not variable length encoded and probably will not be in the future.

Those facts are self explanatory. I do not understand advocating general use of Utf-16. It is variable length encoded (can not be accessed by index), it has problems to cover whole Unicode range even at present, byte order must be handled, etc. I do not see any advantage except that it is natively used in Windows and some other places. Even though when writing multi-platform code it is probably better to use Utf-8 natively and make conversions only at the end points in platform dependent way (as already suggested). When direct access by index is necessary and memory is not a problem, Utf-32 should be used.

The main problem is that many programmers dealing with Windows Unicode = Utf-16 do not even know or ignore the fact that it is variable length encoded.

The way it is usually in *nix platform is pretty good, c strings (char *) interpreted as Utf-8 encoded, wide c strings (wchar_t *) interpreted as Utf-32.


(2) Note: UTF-16 does covers All Unicode as Unicode Consortium decided that 10FFFF is the TOP range of Unicode and defined UTF-8 maximal 4 bytes length and explicitly excluded range 0xD800-0xDFFF from valid code points range and this range is used for creation of surrogate pairs. So any valid Unicode text can be represented with each of one of these encodings. Also about growing to future. It doesn't seems that 1 Million code points would not be enough in any far future. - Artyom
Exactly, all the encodings cover all the code points; and as for the lack of available codes, I don't see how this can be possible in forseeable future. Most supplementary planes are still unused, and even the used ones aren't full yet. So given the total sizes of the known writing systems left, it is very possible that most planes will never be used, unless they start to use code points for something different than writing systems. By the way, UTF-8 can theoretically include 6-byte sequences, so it can represent even more code points than UTF-32, but what's the point? - Malcolm
Malcolm: Not all encodings cover all codepoints. UCS-2 is if you will the fixed-size subset of UTF-16; it only covers the BMP. - Kerrek SB
@Kerrek: Incorrect: UCS-2 is not a valid Unicode encoding. All UTF-* encodings by definition can represent any Unicode code point that is legal for interchange. UCS-2 can represent far fewer than that, plus a few more. Repeat: UCS-2 is not a valid Unicode encoding, any moreso than ASCII is. - tchrist
@_tchrist: You're right, UCS-2 isn't an encoding, it's a subset. In that sense, all encodings for Unicode must by definition be able to represent all Unicode codepoints. Fair point. - Kerrek SB
"I do not understand advocating general use of Utf-8. It is variable length encoded (can not be accessed by index)" - Ian Boyd
@Ian Boyd, the need to access a string’s individual character in a random access pattern is incredibly overstated. It is about as common as wanting to compute the diagonal of a matrix of characters, which is super rare. Strings are virtually always processed sequentially, and since accessing UTF-8 char N+1 given that you are at UTF-8 char N is O(1), there is no issue. There is surpassingly little need to make random access of strings. Whether you think it is worth the storage space to go to UTF-32 instead of UTF-8 is your own opinion, but for me, it is altogether a non-issue. - tchrist
17
[0] [2009-06-26 17:06:58] pjbeardsley

My guesses as to the why the Windows API (and presumably the Qt libraries) use UTF-16:

  • UTF-8 wasn't around when these APIs were being developed.
  • The OS needs to do a lookup on the code points to display the glyphs-- if the data is passed around internally as UTF-8, every time it needs to do that for a multibyte character, it would have to convert from UTF-8 to UTF-16/32. If the bytestream is stored as "wide" chars in memory, it won't need to do this conversion. So increased memory usage is a tradeoff for decreased conversion work and complexity.

When writing to a stream, however, it's considered best practice to use UTF-8 for the reasons outlined in the Joel article referenced above.


(7) Actually UTF-8 was before utf-16 developed. At the begining there was UCS-2 because at these days unicode code point was at most 16 bits - Artyom
Actually UTF-8 was around before these APIs were developed too - it was invented in 1992. The very first OS to implement any sort of UCS/Unicode support was Plan9, and it used UTF-8. - R..
18
[0] [2010-08-12 07:15:56] Yuhong Bao

Anyone consider this a deja vu from when DBCS had the same problems? What about UTF-8 programs that don't really handle 4-byte chars properly? It is why Windows do not support it as the ANSI codepage. One last thing, what version of Windows did you try this on? I just tried this myself on Chinese Windows 2000 (the first version of Windows that claims to support UTF-16) and the standard edit control do handle it correctly.


(3) This happens on Windows XP. Also you may accidentally copied a character that inside BMP. Believe me it happens - a lot. Now, I never found any UTF-8 enabled software that wasn't able to deal with 4 chars. Because if you already deal with variable length (and this means you are using anithing non-in-ascii) then generally you'll do it right as you respect variable length. This does not happens in case of UTF-16 as 95% of all programmers are sure that UTF-16 is fixed length encoding and even they know it they almost never checks the application with text outside of BMP as it is quite rare. - Artyom
19