Question

I'm going to ask what is probably quite a controversial question: "Should one of the most popular encodings, UTF-16, be considered harmful?"

Why do I ask this question?

How many programmers are aware of the fact that UTF-16 is actually a variable length encoding? By this I mean that there are code points that, represented as surrogate pairs, take more then one element.

I know; lots of applications, frameworks and APIs use UTF-16, such as Java's String, C#'s String, Win32 APIs, Qt GUI libraries, the ICU Unicode library, etc. However, with all of that, there are lots of basic bugs in the processing of characters out of BMP (characters that should be encoded using two UTF-16 elements).

For example, try to edit one of these characters:

𝄞 ( U+1D11E ^[1]) MUSICAL SYMBOL G CLEF
𝕥 ( U+1D565 ^[2]) MATHEMATICAL DOUBLE-STRUCK SMALL T
𝟶 ( U+1D7F6 ^[3]) MATHEMATICAL MONOSPACE DIGIT ZERO
𠂊 ( U+2008A ^[4]) Han Character

You may miss some, depending on what fonts you have installed. These characters are all outside of the BMP (Basic Multilingual Plane). If you cannot see these characters, you can also try looking at them in the Unicode Character reference ^[5].

For example, try to create file names in Windows that include these characters; try to delete these characters with a "backspace" to see how they behave in different applications that use UTF-16. I did some tests and the results are quite bad:

Opera has problem with editing them (delete required 2 presses on backspace)
Notepad can't deal with them correctly (delete required 2 presses on backspace)
File names editing in Window dialogs in broken (delete required 2 presses on backspace)
All QT3 applications can't deal with them - show two empty squares instead of one symbol.
Python encodes such characters incorrectly when used directly u'X'!=unicode('X','utf-16') on some platforms when X in character outside of BMP.
Python 2.5 unicodedata fails to get properties on such characters when python compiled with UTF-16 Unicode strings.
StackOverflow seems to remove these characters from the text if edited directly in as Unicode characters (these characters are shown using HTML Unicode escapes).
WinForms TextBox may generate invalid string ^[6] when limited with MaxLength.

It seems that such bugs are extremely easy to find in many applications that use UTF-16.

So... Do you think that UTF-16 should be considered harmful?

[1] http://www.fileformat.info/info/unicode/char/1d11e/index.htm
[2] http://www.fileformat.info/info/unicode/char/1d565/index.htm
[3] http://www.fileformat.info/info/unicode/char/1d7f6/index.htm
[4] http://www.fileformat.info/info/unicode/char/2008a/index.htm
[5] http://en.wikibooks.org/wiki/Unicode/Character_reference/1D000-1DFFF
[6] http://blogs.msdn.com/b/michkap/archive/2010/12/15/10105168.aspx

Answer 1

Opinion: Yes, UTF-16 should be considered harmful. The very reason it exists is because some time ago there used to be a misguided belief that widechar is going to be what UCS-4 now is.

Despite the "anglo-centrism" of UTF-8, it should be considered the only useful encoding for text. One can argue that source codes of programs, web pages and XML files, OS file names and other computer-to-computer text interfaces should never have existed. But when they do, text is not only for human readers.

On the other hand, UTF-8 overhead is a small price to pay while it has significant advantages. Advantages such as compatibility with unaware code that just passes strings with char*. This is a great thing. There're few useful characters which are SHORTER in UTF-16 than they are in UTF-8.

I believe that all other encodings will die eventually. This involves that MS-Windows, Java, ICU, python stop using it as their favorite. After long research and discussions, the development conventions at my company ^[1] ban using UTF-16 anywhere except OS API calls, and this despite importance of performance in our applications and the fact that we use Windows. Conversion functions were developed to convert always-assumed-UTF8 std::strings to native UTF-16, which Windows itself does not support properly ^[2].

To people who say "use what needed where it is needed", I say: there's a huge advantage to using the same encoding everywhere, and I see no sufficient reason to do otherwise. In particular, I think adding wchar_t to C++ was a mistake, and so are the Unicode additions to C++0x. What must be demanded from STL implementations though is that every std::string or char* parameter would be considered unicode-compatible.

I am also against the "use what you want" approach. I see no reason for such liberty. There's enough confusion on the subject of text, resulting in all this broken software. Having above said, I am convinced that programmers must finally reach consensus on UTF-8 as one proper way. (I come from a non-ascii-speaking country and grew up on Windows, so I'd be last expected to attack UTF-16 based on religious grounds).

I'd like to share more information on how I do text on Windows, and what I recommend to everyone else for compile-time checked unicode correctness, ease of use and better multi-platformness of the code. The suggestion substantially differs from what is usually recommended as the proper way of using Unicode on windows. Yet, in depth research of these recommendations resulted in the same conclusion. So here goes:

Do not use wchar_t or std::wstring in any place other than adjacent point to APIs accepting UTF-16.
Don't use _T("") or L"" UTF-16 literals (These should IMO be taken out of the standard, as a part of UTF-16 deprecation).
Don't use types, functions or their derivatives that are sensitive to the _UNICODE constant, such as LPTSTR or CreateWindow().
Yet, _UNICODE always defined, to avoid passing char* strings to WinAPI getting silently compiled
std::strings and char* anywhere in program are considered UTF-8 (if not said otherwise)
All my strings are std::string, though you can pass char* or string literal to convert(const std::string &).
only use Win32 functions that accept widechars (LPWSTR). Never those which accept LPTSTR or LPSTR. Pass parameters this way:
```
::SetWindowTextW(Utils::convert(someStdString or "string litteral").c_str())
```
(The policy uses conversion functions below.)

With MFC strings:

CString someoneElse; // something that arrived from MFC. Converted as soon as possible, before passing any further away from the API call:

std::string s = str(boost::format("Hello %s\n") % Convert(someoneElse));
AfxMessageBox(MfcUtils::Convert(s), _T("Error"), MB_OK);

Working with files, filenames and fstream on Windows:
- Never pass std::string or const char* filename arguments to fstream family. MSVC STL does not support UTF-8 arguments, but has a non-standard extension which should be used as follows:
- Convert std::string arguments to std::wstring with Utils::Convert:
```
std::ifstream ifs(Utils::Convert("hello"),
                  std::ios_base::in |
                  std::ios_base::binary);
```
  We'll have to manually remove the convert, when MSVC's attitude to fstream changes.
- This code is not multi-platform and may have to be changed manually in the future
- See fstream unicode research/discussion case 4215 for more info.
- Never produce text output files with non-UTF8 content
- Avoid using fopen() for RAII/OOD reasons. If necessary, use _wfopen() and WinAPI conventions above.

// For interface to win32 API functions
std::string convert(const std::wstring& str, unsigned int codePage /*= CP_UTF8*/)
{
    // Ask me for implementation..
    ...
}

std::wstring convert(const std::string& str, unsigned int codePage /*= CP_UTF8*/)
{
    // Ask me for implementation..
    ...
}

// Interface to MFC
std::string convert(const CString &mfcString)
{
#ifdef UNICODE
    return Utils::convert(std::wstring(mfcString.GetString()));
#else
    return mfcString.GetString();   // This branch is deprecated.
#endif
}

CString convert(const std::string &s)
{
#ifdef UNICODE
    return CString(Utils::convert(s).c_str());
#else
    Exceptions::Assert(false, "Unicode policy violation. See W569"); // This branch is deprecated as it does not support unicode
    return s.c_str();   
#endif
}

[1] http://www.visionmap.com
[2] http://blogs.msdn.com/michkap/archive/2005/12/21/506248.aspx

Answer 2

Unicode codepoints are not characters! Sometimes they are not even glyphs (visual forms).

Some examples:

Roman numeral codepoints like "ⅲ". (A single character that looks like "iii".)
Accented characters like "á", which can be represented as either a single combined character "\u00e1" or a character and separated diacritic "\u0061\u0301".
Characters like Greek lowercase sigma, which have different forms for middle ("σ") and end ("ς") of word positions, but which should be considered synonyms for search.
Unicode discretionary hyphen U+00AD, which might or might not be visually displayed, depending on context, and which is ignored for semantic search.

The only ways to get Unicode editing right is to use a library written by an expert, or become an expert and write one yourself. If you are just counting codepoints, you are living in a state of sin.

Answer 3

There is a simple rule of thumb on what Unicode Transformation Form (UTF) to use: - utf-8 for storage and comunication - utf-16 for data processing - you might go with utf-32 if most of the platform API you use is utf-32 (common in the UNIX world).

Most systems today use utf-16 (Windows, Mac OS, Java, .NET, ICU, Qt). Also see this document: http://unicode.org/notes/tn12/

Back to "UTF-16 as harmful", I would say: definitely not.

People who are afraid of surrogates (thinking that they transform Unicode into a variable-length encoding) don't understand the other (way bigger) complexities that make mapping between characters and a Unicode code point very complex: combining characters, ligatures, variation selectors, control characters, etc.

Just read this series here http://blogs.msdn.com/michkap/archive/2009/06/29/9800913.aspx and see how UTF-16 becomes an easy problem.

Answer 4

There is nothing wrong with Utf-16 encoding. But languages that treat the 16-bit units as characters should probably be considered badly designed. Having a type named 'char' which does not always represent a character is pretty confusing. Since most developers will expect a char type to represent a code point or character, much code will probably break when exposed to characters beyound BMP.

Note however that even using utf-32 does not mean that each 32-bit code point will always represent a character. Due to combining characters, an actual character may consist of several code points. Unicode is never trivial.

BTW. There is probably the same class of bugs with platforms and applications which expect characters to be 8-bit, which are fed Utf-8.

Answer 5

I would suggest that thinking UTF-16 might be considered harmful says that you need to gain a greater understanding of unicode ^[1].

Since I've been downvoted for presenting my opinion on a subjective question, let me elaborate. What exactly is it that bothers you about UTF-16? Would you prefer if everything was encoded in UTF-8? UTF-7? Or how about UCS-4? Of course certain applications are not designed to handle everysingle character code out there - but they are necessary, especially in today's global information domain, for communication between international boundaries.

But really, if you feel UTF-16 should be considered harmful because it's confusing or can be improperly implemented (unicode certainly can be), then what method of character encoding would be considered non-harmful?

EDIT: To clarify: Why consider improper implementations of a standard a reflection of the quality of the standard itself? As others have subsequently noted, merely because an application uses a tool inappropriately, does not mean that the tool itself is defective. If that were the case, we could probably say things like "var keyword considered harmful", or "threading considered harmful". I think the question confuses the quality and nature of the standard with the difficulties many programmers have in implementing and using it properly, which I feel stem more from their lack of understanding how unicode works, rather than unicode itself.

[1] http://www.joelonsoftware.com/articles/Unicode.html

Answer 6

Well, there is an encoding that uses fixed-size symbols. I certainly mean UTF-32. But 4 bytes for each symbol is too much of wasted space, why would we use it in everyday situations?

Actually I don't undesrstand why it's so big deal anyway. Characters outside BMP are encountered only in very specific cases and areas. According to the Unicode official FAQ ^[1], "even in East Asian text, the incidence of surrogate pairs should be well less than 1% of all text storage on average". Of course, characters outside BMP shouldn't be neglected, but most programs which use Unicode are not intended for working with texts containing such characters. That's why if they don't support it, it is unpleasant, but not a catastrophy.

To my mind, most problems appear from the fact that some software fell behind the Unicode standard, but were not quick to correct the situation. Opera, Windows, Python, Qt - all of them appeared before UTF-16 became widely known or even came into existence. I can confirm, though, that in Opera, Windows Explorer, and Notepad there are no problems with characters outside BMP anymore (at least on my PC). And if you choose to create your own implementation of decoder/encoder today, you will almost certainly know about the concept of surrogate pairs in UTF-16.

Also it is wrong to think that it is easy to mess up with determining the string length only in UTF-16. If you use UTF-8 or UTF-32, you still should be aware that one Unicode code point doesn't necessarily mean one character.

Therefore I don't think it should be considered harmful. UTF-16 is a compromise between simplicity and compactness, and there's no harm in using what is needed where it is needed. In some cases you need to remain compatible with ASCII and you need UTF-8, in some cases you want to work with work with Han ideographs and conserve space using UTF-16, in some cases you need universal representations of characters usign a fixed-length encoding. Use what's more appropriate, just do it properly.

[1] http://www.unicode.org/faq//utf_bom.html#utf16-5

Answer 7

Years of Windows internationalization work especially in East Asian languages might have corrupted me, but I lean toward UTF-16 for internal-to-the-program representations of strings, and UTF-8 for network or file storage of plaintext-like documents. UTF-16 can usually be processed faster on Windows, though, so that's the primary benefit of using UTF-16 in Windows.

Making the leap to UTF-16 dramatically improved the adequacy of average products handling international text. There are only a few narrow cases when the surrogate pairs need to be considered (deletions, insertions, and line breaking, basically) and the average-case is mostly straight pass-through. And unlike earlier encodings like JIS variants, UTF-16 limits surrogate pairs to a very narrow range, so the check is really quick and works forward and backward.

Granted, it's roughly as quick in correctly-encoded UTF-8, too. But there's also many broken UTF-8 applications that incorrectly encode surrogate pairs as two UTF-8 sequences. So UTF-8 doesn't guarantee salvation either.

IE handles surrogate pairs reasonably well since 2000 or so, even though it typically is converting them from UTF-8 pages to an internal UTF-16 representation; I'm fairly sure Firefox has got it right too, so I don't really care what Opera does.

UTF-32 (aka UCS4) is pointless for most applications since it's so space-demanding, so it's pretty much a nonstarter.

Answer 8

My personal choice is to always use UTF-8. It's the standard on Linux for nearly everything. It's backwards compatible with many legacy apps. There is a very minimal overhead in terms of extra space used for non-latin characters vs the other UTF formats, and there is a significant savings in space for latin characters. On the web, latin languages reign supreme, and I think they will for the foreseeable future. And to address one of the main arguments in the original post: nearly every programmer is aware that UTF-8 will sometimes have multi-byte characters in it. Not everyone deals with this correctly, but they are usually aware, which is more than can be said for UTF-16. But, of course, you need to choose the one most appropriate for your application. That's why there's more than one in the first place.

Answer 9

UTF-8 is definitely the way to go, possibly accompanied by UTF-32 for internal use in algorithms that need high performance random access (but that ignores combining chars).

Both UTF-16 and UTF-32 (as well as their LE/BE variants) suffer of endianess issues, so they should never be used externally.

Answer 10

I wouldn't necessarily say that UTF-16 is harmful. It's not elegant, but it serves its purpose of backwards compatibility with UCS-2, just like GB18030 does with GB2312, and UTF-8 does with ASCII.

But making a fundamental change to the structure of Unicode in midstream, after Microsoft and Sun had built huge APIs around 16-bit characters, was harmful. The failure to spread awareness of the change was more harmful.

Answer 11

UTF-16 is the best compromise between handling and space ^[1] and that's why most major platforms (Win32, Java, .NET) use it for internal representation of strings.

[1] http://publib.boulder.ibm.com/infocenter/iseries/v6r1m0/index.jsp?topic=/nls/rbagsutf16.htm

Answer 12

Add this to the list: http://blogs.msdn.com/b/michkap/archive/2010/12/15/10105168.aspx

Answer 13

Someone said UCS4 and UTF-32 were same. No so, but I know what you mean. One of them is an encoding of the other, though. I wish they'd though to specify endianness from the first so we wouldn't have the endianess battle fought out here too. Couldn't they have seen that coming? At least utf-8 is the same everywhere (unless someone is following the original spec with 6-bytes). Sigh. If you use utf-16 you HAVE to include handling for multibyte chars. You can't go to the Nth character by indexing 2N into a byte array. You have to walk it, or have character indices. Otherwise you've written a bug. The current draft spec of C++ says that utf-32 and utf16 can have little-endian, big-endian, and unspecified variants. Really? If Unicode had specified that everyone had to do little-endian from the beginngin then it would have all been simpler. (I would have been fine with big-endian as well.) Instead, some people implemented it one way, some the other, and now we're stuck with silliness for nothing. Sometimes it's embarrassing to be a software engineer.

Answer 14

UTF-16? definitely harmful. Just my grain of salt here, but there are exactly three acceptable encodings for text in a program:

ASCII: when dealing with low level things (eg: microcontrollers) that can't afford anything better
UTF8: storage in fixed-width media such as files
integer codepoints ("CP"?): an array of the largest integers that are convenient for your programming language and platform (decays to ASCII in the limit of low resorces). Should be int32 on older computers and int64 on anything with 64-bit addressing.
Obviously interfaces to legacy code use what encoding is needed to make the old code work right.

Answer 15

This totally depends on your application. For most people, UTF-16BE is a good compromise. Other choices are either too expensive to find characters (UTF-8) or waste too much space (UTF-32 or UCS-4, where each character takes 4 bytes).

With UTF-16BE, you can treat it as UCS-2 (fixed length) in most cases. Characters beyond BMP are rare in normal applications. You still have the option to handle surrogate pair if you choose to, say you are writing an archaeology application.

Answer 16

Yes, absolutely.

Why? It has to do with exercising code.

If you look at these codepoint usage statistics on a large corpus ^[1] by Tom Christiansen you'll see that trans-8bit BMP codepoints are used several orders if magnitude more than non-BMP codepoints:

 2663710 U+002013 ‹–›  GC=Pd    EN DASH
 1065594 U+0000A0 ‹ ›  GC=Zs    NO-BREAK SPACE
 1009762 U+0000B1 ‹±›  GC=Sm    PLUS-MINUS SIGN
  784139 U+002212 ‹−›  GC=Sm    MINUS SIGN
  602377 U+002003 ‹ ›  GC=Zs    EM SPACE

 544 U+01D49E ‹𝒞›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL C
 450 U+01D4AF ‹𝒯›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL T
 385 U+01D4AE ‹𝒮›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL S
 292 U+01D49F ‹𝒟›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL D
 285 U+01D4B3 ‹𝒳›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL X

Take the TDD dictum: "Untested code is broken code", and rephrase it as "unexercised code is broken code", and think how often programmers have to deal with non-BMP codepoints.

Bugs related to not dealing with UTF-16 as a variable-width encoding are much more likely to go unnoticed than the equivalent bugs in UTF-8. Some programming languages still don't guarantee to give you UTF-16 instead of UCS-2, and some so-called high-level programming languages offer access to code units instead of code-points (even C is supposed to give you access to codepoints if you use wchar_t, regardless of what some platforms may do).

[1] http://stackoverflow.com/questions/5567249/what-are-the-most-common-non-bmp-unicode-characters-in-actual-use/5575000#5575000

Answer 17

Unicode defines code points up to 0x10FFFF (1,114,112 codes), all applications running in multilingual environment dealing with strings/file names etc. should handle that correctly.

Utf-16: covers only 1,112,064 codes. Although those at the end of Unicode are from planes 15-16 (Private Use Area). It can not grow any further in the future except breaking Utf-16 concept.

Utf-8: covers theoretically 2,216,757,376 codes. Current range of Unicode codes can be represented by maximally 4 byte sequence. It does not suffer with byte order problem, it is "compatible" with ascii.

Utf-32: covers theoretically 2^32=4,294,967,296 codes. Currently it is not variable length encoded and probably will not be in the future.

Those facts are self explanatory. I do not understand advocating general use of Utf-16. It is variable length encoded (can not be accessed by index), it has problems to cover whole Unicode range even at present, byte order must be handled, etc. I do not see any advantage except that it is natively used in Windows and some other places. Even though when writing multi-platform code it is probably better to use Utf-8 natively and make conversions only at the end points in platform dependent way (as already suggested). When direct access by index is necessary and memory is not a problem, Utf-32 should be used.

The main problem is that many programmers dealing with Windows Unicode = Utf-16 do not even know or ignore the fact that it is variable length encoded.

The way it is usually in *nix platform is pretty good, c strings (char *) interpreted as Utf-8 encoded, wide c strings (wchar_t *) interpreted as Utf-32.

Answer 18

My guesses as to the why the Windows API (and presumably the Qt libraries) use UTF-16:

UTF-8 wasn't around when these APIs were being developed.
The OS needs to do a lookup on the code points to display the glyphs-- if the data is passed around internally as UTF-8, every time it needs to do that for a multibyte character, it would have to convert from UTF-8 to UTF-16/32. If the bytestream is stored as "wide" chars in memory, it won't need to do this conversion. So increased memory usage is a tradeoff for decreased conversion work and complexity.

When writing to a stream, however, it's considered best practice to use UTF-8 for the reasons outlined in the Joel article referenced above.

Answer 19

Anyone consider this a deja vu from when DBCS had the same problems? What about UTF-8 programs that don't really handle 4-byte chars properly? It is why Windows do not support it as the ANSI codepage. One last thing, what version of Windows did you try this on? I just tried this myself on Chinese Windows 2000 (the first version of Windows that claims to support UTF-16) and the standard edit control do handle it correctly.