I'm going to ask what is probably quite a controversial question: "Should one of the most popular encodings, UTF-16, be considered harmful?"
Why do I ask this question?
How many programmers are aware of the fact that UTF-16 is actually a variable length encoding? By this I mean that there are code points that, represented as surrogate pairs, take more then one element.
I know; lots of applications, frameworks and APIs use UTF-16, such as Java's String, C#'s String, Win32 APIs, Qt GUI libraries, the ICU Unicode library, etc. However, with all of that, there are lots of basic bugs in the processing of characters out of BMP (characters that should be encoded using two UTF-16 elements).
For example, try to edit one of these characters:
You may miss some, depending on what fonts you have installed. These characters are all outside of the BMP (Basic Multilingual Plane). If you cannot see these characters, you can also try looking at them in the Unicode Character reference [5].
For example, try to create file names in Windows that include these characters; try to delete these characters with a "backspace" to see how they behave in different applications that use UTF-16. I did some tests and the results are quite bad:
u'X'!=unicode('X','utf-16')
on some platforms when X in character outside of BMP.It seems that such bugs are extremely easy to find in many applications that use UTF-16.
So... Do you think that UTF-16 should be considered harmful?
Opinion: Yes, UTF-16 should be considered harmful. The very reason it exists is because some time ago there used to be a misguided belief that widechar is going to be what UCS-4 now is.
Despite the "anglo-centrism" of UTF-8, it should be considered the only useful encoding for text. One can argue that source codes of programs, web pages and XML files, OS file names and other computer-to-computer text interfaces should never have existed. But when they do, text is not only for human readers.
On the other hand, UTF-8 overhead is a small price to pay while it has significant advantages. Advantages such as compatibility with unaware code that just passes strings with char*
. This is a great thing. There're few useful characters which are SHORTER in UTF-16 than they are in UTF-8.
I believe that all other encodings will die eventually. This involves that MS-Windows, Java, ICU, python stop using it as their favorite. After long research and discussions, the development conventions at
my company
[1] ban using UTF-16 anywhere except OS API calls, and this despite importance of performance in our applications and the fact that we use Windows. Conversion functions were developed to convert always-assumed-UTF8 std::string
s to native UTF-16, which Windows itself
does not support properly
[2].
To people who say "use what needed where it is needed", I say: there's a huge advantage to using the same encoding everywhere, and I see no sufficient reason to do otherwise. In particular, I think adding wchar_t
to C++ was a mistake, and so are the Unicode additions to C++0x. What must be demanded from STL implementations though is that every std::string
or char*
parameter would be considered unicode-compatible.
I am also against the "use what you want" approach. I see no reason for such liberty. There's enough confusion on the subject of text, resulting in all this broken software. Having above said, I am convinced that programmers must finally reach consensus on UTF-8 as one proper way. (I come from a non-ascii-speaking country and grew up on Windows, so I'd be last expected to attack UTF-16 based on religious grounds).
I'd like to share more information on how I do text on Windows, and what I recommend to everyone else for compile-time checked unicode correctness, ease of use and better multi-platformness of the code. The suggestion substantially differs from what is usually recommended as the proper way of using Unicode on windows. Yet, in depth research of these recommendations resulted in the same conclusion. So here goes:
wchar_t
or std::wstring
in any place other than adjacent point to APIs accepting UTF-16. _T("")
or L""
UTF-16 literals (These should IMO be taken out of the standard, as a part of UTF-16 deprecation)._UNICODE
constant, such as LPTSTR
or CreateWindow()
._UNICODE
always defined, to avoid passing char*
strings to WinAPI getting silently compiledstd::strings
and char*
anywhere in program are considered UTF-8 (if not said otherwise)std::string
, though you can pass char* or string literal to convert(const std::string &)
.only use Win32 functions that accept widechars (LPWSTR
). Never those which accept LPTSTR
or LPSTR
. Pass parameters this way:
::SetWindowTextW(Utils::convert(someStdString or "string litteral").c_str())
(The policy uses conversion functions below.)
With MFC strings:
CString someoneElse; // something that arrived from MFC. Converted as soon as possible, before passing any further away from the API call:
std::string s = str(boost::format("Hello %s\n") % Convert(someoneElse));
AfxMessageBox(MfcUtils::Convert(s), _T("Error"), MB_OK);
Working with files, filenames and fstream on Windows:
std::string
or const char*
filename arguments to fstream
family. MSVC STL does not support UTF-8 arguments, but has a non-standard extension which should be used as follows:Convert std::string
arguments to std::wstring
with Utils::Convert
:
std::ifstream ifs(Utils::Convert("hello"),
std::ios_base::in |
std::ios_base::binary);
We'll have to manually remove the convert, when MSVC's attitude to fstream
changes.
fstream
unicode research/discussion case 4215 for more info.fopen()
for RAII/OOD reasons. If necessary, use _wfopen()
and WinAPI conventions above.// For interface to win32 API functions
std::string convert(const std::wstring& str, unsigned int codePage /*= CP_UTF8*/)
{
// Ask me for implementation..
...
}
std::wstring convert(const std::string& str, unsigned int codePage /*= CP_UTF8*/)
{
// Ask me for implementation..
...
}
// Interface to MFC
std::string convert(const CString &mfcString)
{
#ifdef UNICODE
return Utils::convert(std::wstring(mfcString.GetString()));
#else
return mfcString.GetString(); // This branch is deprecated.
#endif
}
CString convert(const std::string &s)
{
#ifdef UNICODE
return CString(Utils::convert(s).c_str());
#else
Exceptions::Assert(false, "Unicode policy violation. See W569"); // This branch is deprecated as it does not support unicode
return s.c_str();
#endif
}
[1] http://www.visionmap.comUnicode codepoints are not characters! Sometimes they are not even glyphs (visual forms).
Some examples:
The only ways to get Unicode editing right is to use a library written by an expert, or become an expert and write one yourself. If you are just counting codepoints, you are living in a state of sin.
There is a simple rule of thumb on what Unicode Transformation Form (UTF) to use: - utf-8 for storage and comunication - utf-16 for data processing - you might go with utf-32 if most of the platform API you use is utf-32 (common in the UNIX world).
Most systems today use utf-16 (Windows, Mac OS, Java, .NET, ICU, Qt). Also see this document: http://unicode.org/notes/tn12/
Back to "UTF-16 as harmful", I would say: definitely not.
People who are afraid of surrogates (thinking that they transform Unicode into a variable-length encoding) don't understand the other (way bigger) complexities that make mapping between characters and a Unicode code point very complex: combining characters, ligatures, variation selectors, control characters, etc.
Just read this series here http://blogs.msdn.com/michkap/archive/2009/06/29/9800913.aspx and see how UTF-16 becomes an easy problem.
equalsIgnoreCase
method (also others in the string class) that would never have been there had Java used either UTF-8 or UTF-32. There are millions of these sleeping bombshells in any code that uses UTF-16, and I am sick and tired of them. UTF-16 is a vicious pox that plagues our software with insidious bugs forever and ever. It is clearly harmful, and should be deprecated and banned. - tchrist
There is nothing wrong with Utf-16 encoding. But languages that treat the 16-bit units as characters should probably be considered badly designed. Having a type named 'char
' which does not always represent a character is pretty confusing. Since most developers will expect a char type to represent a code point or character, much code will probably break when exposed to characters beyound BMP.
Note however that even using utf-32 does not mean that each 32-bit code point will always represent a character. Due to combining characters, an actual character may consist of several code points. Unicode is never trivial.
BTW. There is probably the same class of bugs with platforms and applications which expect characters to be 8-bit, which are fed Utf-8.
CodePoint
type, holding a single code point (21 bits), a CodeUnit
type, holding a single code unit (16 bits for UTF-16) and a Character
type would ideally have to support a complete grapheme. But that makes it functionally equivalent to a String
... - Joey
I would suggest that thinking UTF-16 might be considered harmful says that you need to gain a greater understanding of unicode [1].
Since I've been downvoted for presenting my opinion on a subjective question, let me elaborate. What exactly is it that bothers you about UTF-16? Would you prefer if everything was encoded in UTF-8? UTF-7? Or how about UCS-4? Of course certain applications are not designed to handle everysingle character code out there - but they are necessary, especially in today's global information domain, for communication between international boundaries.
But really, if you feel UTF-16 should be considered harmful because it's confusing or can be improperly implemented (unicode certainly can be), then what method of character encoding would be considered non-harmful?
EDIT: To clarify: Why consider improper implementations of a standard a reflection of the quality of the standard itself? As others have subsequently noted, merely because an application uses a tool inappropriately, does not mean that the tool itself is defective. If that were the case, we could probably say things like "var keyword considered harmful", or "threading considered harmful". I think the question confuses the quality and nature of the standard with the difficulties many programmers have in implementing and using it properly, which I feel stem more from their lack of understanding how unicode works, rather than unicode itself.
[1] http://www.joelonsoftware.com/articles/Unicode.htmlWell, there is an encoding that uses fixed-size symbols. I certainly mean UTF-32. But 4 bytes for each symbol is too much of wasted space, why would we use it in everyday situations?
Actually I don't undesrstand why it's so big deal anyway. Characters outside BMP are encountered only in very specific cases and areas. According to the Unicode official FAQ [1], "even in East Asian text, the incidence of surrogate pairs should be well less than 1% of all text storage on average". Of course, characters outside BMP shouldn't be neglected, but most programs which use Unicode are not intended for working with texts containing such characters. That's why if they don't support it, it is unpleasant, but not a catastrophy.
To my mind, most problems appear from the fact that some software fell behind the Unicode standard, but were not quick to correct the situation. Opera, Windows, Python, Qt - all of them appeared before UTF-16 became widely known or even came into existence. I can confirm, though, that in Opera, Windows Explorer, and Notepad there are no problems with characters outside BMP anymore (at least on my PC). And if you choose to create your own implementation of decoder/encoder today, you will almost certainly know about the concept of surrogate pairs in UTF-16.
Also it is wrong to think that it is easy to mess up with determining the string length only in UTF-16. If you use UTF-8 or UTF-32, you still should be aware that one Unicode code point doesn't necessarily mean one character.
Therefore I don't think it should be considered harmful. UTF-16 is a compromise between simplicity and compactness, and there's no harm in using what is needed where it is needed. In some cases you need to remain compatible with ASCII and you need UTF-8, in some cases you want to work with work with Han ideographs and conserve space using UTF-16, in some cases you need universal representations of characters usign a fixed-length encoding. Use what's more appropriate, just do it properly.
[1] http://www.unicode.org/faq//utf_bom.html#utf16-5char
at a time instead of code point at a time in the String
class’s equalsIgnoreCase
method. The code was never updated for UTF-16 so was stuck in UCS-2 brain damage, so does the wrong thing on anything outside the BMP. Plus it was using casemapping not casefolding, which is bound to get it into trouble. "𐐼𐐯𐑅𐐨𐑉𐐯𐐻", "𐐔𐐯𐑅𐐨𐑉𐐯𐐻", and "𐐔𐐇𐐝𐐀𐐡𐐇𐐓" are all casewise equal to each other, but the dumb UCS-2 Java method was too stupid to know that. ICU gets this right, and they use UTF-16 also, but not stupidly. - tchrist
Years of Windows internationalization work especially in East Asian languages might have corrupted me, but I lean toward UTF-16 for internal-to-the-program representations of strings, and UTF-8 for network or file storage of plaintext-like documents. UTF-16 can usually be processed faster on Windows, though, so that's the primary benefit of using UTF-16 in Windows.
Making the leap to UTF-16 dramatically improved the adequacy of average products handling international text. There are only a few narrow cases when the surrogate pairs need to be considered (deletions, insertions, and line breaking, basically) and the average-case is mostly straight pass-through. And unlike earlier encodings like JIS variants, UTF-16 limits surrogate pairs to a very narrow range, so the check is really quick and works forward and backward.
Granted, it's roughly as quick in correctly-encoded UTF-8, too. But there's also many broken UTF-8 applications that incorrectly encode surrogate pairs as two UTF-8 sequences. So UTF-8 doesn't guarantee salvation either.
IE handles surrogate pairs reasonably well since 2000 or so, even though it typically is converting them from UTF-8 pages to an internal UTF-16 representation; I'm fairly sure Firefox has got it right too, so I don't really care what Opera does.
UTF-32 (aka UCS4) is pointless for most applications since it's so space-demanding, so it's pretty much a nonstarter.
My personal choice is to always use UTF-8. It's the standard on Linux for nearly everything. It's backwards compatible with many legacy apps. There is a very minimal overhead in terms of extra space used for non-latin characters vs the other UTF formats, and there is a significant savings in space for latin characters. On the web, latin languages reign supreme, and I think they will for the foreseeable future. And to address one of the main arguments in the original post: nearly every programmer is aware that UTF-8 will sometimes have multi-byte characters in it. Not everyone deals with this correctly, but they are usually aware, which is more than can be said for UTF-16. But, of course, you need to choose the one most appropriate for your application. That's why there's more than one in the first place.
UTF-8 is definitely the way to go, possibly accompanied by UTF-32 for internal use in algorithms that need high performance random access (but that ignores combining chars).
Both UTF-16 and UTF-32 (as well as their LE/BE variants) suffer of endianess issues, so they should never be used externally.
I wouldn't necessarily say that UTF-16 is harmful. It's not elegant, but it serves its purpose of backwards compatibility with UCS-2, just like GB18030 does with GB2312, and UTF-8 does with ASCII.
But making a fundamental change to the structure of Unicode in midstream, after Microsoft and Sun had built huge APIs around 16-bit characters, was harmful. The failure to spread awareness of the change was more harmful.
UTF-16 is the best compromise between handling and space [1] and that's why most major platforms (Win32, Java, .NET) use it for internal representation of strings.
[1] http://publib.boulder.ibm.com/infocenter/iseries/v6r1m0/index.jsp?topic=/nls/rbagsutf16.htmAdd this to the list: http://blogs.msdn.com/b/michkap/archive/2010/12/15/10105168.aspx
Someone said UCS4 and UTF-32 were same. No so, but I know what you mean. One of them is an encoding of the other, though. I wish they'd though to specify endianness from the first so we wouldn't have the endianess battle fought out here too. Couldn't they have seen that coming? At least utf-8 is the same everywhere (unless someone is following the original spec with 6-bytes). Sigh. If you use utf-16 you HAVE to include handling for multibyte chars. You can't go to the Nth character by indexing 2N into a byte array. You have to walk it, or have character indices. Otherwise you've written a bug. The current draft spec of C++ says that utf-32 and utf16 can have little-endian, big-endian, and unspecified variants. Really? If Unicode had specified that everyone had to do little-endian from the beginngin then it would have all been simpler. (I would have been fine with big-endian as well.) Instead, some people implemented it one way, some the other, and now we're stuck with silliness for nothing. Sometimes it's embarrassing to be a software engineer.
UTF-16? definitely harmful. Just my grain of salt here, but there are exactly three acceptable encodings for text in a program:
integer codepoints ("CP"?): an array of the largest integers that are convenient for your programming language and platform (decays to ASCII in the limit of low resorces). Should be int32 on older computers and int64 on anything with 64-bit addressing.
Obviously interfaces to legacy code use what encoding is needed to make the old code work right.
U+10FFFF
. You are talking about UTF-32/UCS-4 (they are identical). If you are thinking about speed, 32->64 is not 16->32; int64 is not faster for 64-processors. - Simon Buchan
U+10ffff
max will go out the window when (not if) they run out of codepoints. That said, useing int32 on a p64 system for speed is probably safe, since i doubt they'll exceed U+ffffffff
before you're forced to rewrite your code for 128 bit systems around 2050. (That is the point of "use the largest int that is convenient" as opposed to "largest available" (which would probably be int256 or bignums or something).) - David X
U+10FFFF
. This really is one of those situations when 21 bits is enough for anybody. - Simon Buchan
This totally depends on your application. For most people, UTF-16BE is a good compromise. Other choices are either too expensive to find characters (UTF-8) or waste too much space (UTF-32 or UCS-4, where each character takes 4 bytes).
With UTF-16BE, you can treat it as UCS-2 (fixed length) in most cases. Characters beyond BMP are rare in normal applications. You still have the option to handle surrogate pair if you choose to, say you are writing an archaeology application.
strlen
in C forever. This is no hardship at all. - tchrist
Yes, absolutely.
Why? It has to do with exercising code.
If you look at these codepoint usage statistics on a large corpus [1] by Tom Christiansen you'll see that trans-8bit BMP codepoints are used several orders if magnitude more than non-BMP codepoints:
2663710 U+002013 ‹–› GC=Pd EN DASH
1065594 U+0000A0 ‹ › GC=Zs NO-BREAK SPACE
1009762 U+0000B1 ‹±› GC=Sm PLUS-MINUS SIGN
784139 U+002212 ‹−› GC=Sm MINUS SIGN
602377 U+002003 ‹ › GC=Zs EM SPACE
544 U+01D49E ‹𝒞› GC=Lu MATHEMATICAL SCRIPT CAPITAL C
450 U+01D4AF ‹𝒯› GC=Lu MATHEMATICAL SCRIPT CAPITAL T
385 U+01D4AE ‹𝒮› GC=Lu MATHEMATICAL SCRIPT CAPITAL S
292 U+01D49F ‹𝒟› GC=Lu MATHEMATICAL SCRIPT CAPITAL D
285 U+01D4B3 ‹𝒳› GC=Lu MATHEMATICAL SCRIPT CAPITAL X
Take the TDD dictum: "Untested code is broken code", and rephrase it as "unexercised code is broken code", and think how often programmers have to deal with non-BMP codepoints.
Bugs related to not dealing with UTF-16 as a variable-width encoding are much more likely to go unnoticed than the equivalent bugs in UTF-8. Some programming languages still don't guarantee to give you UTF-16 instead of UCS-2, and some so-called high-level programming languages offer access to code units instead of code-points (even C is supposed to give you access to codepoints if you use wchar_t
, regardless of what some platforms may do).
Unicode defines code points up to 0x10FFFF (1,114,112 codes), all applications running in multilingual environment dealing with strings/file names etc. should handle that correctly.
Utf-16: covers only 1,112,064 codes. Although those at the end of Unicode are from planes 15-16 (Private Use Area). It can not grow any further in the future except breaking Utf-16 concept.
Utf-8: covers theoretically 2,216,757,376 codes. Current range of Unicode codes can be represented by maximally 4 byte sequence. It does not suffer with byte order problem, it is "compatible" with ascii.
Utf-32: covers theoretically 2^32=4,294,967,296 codes. Currently it is not variable length encoded and probably will not be in the future.
Those facts are self explanatory. I do not understand advocating general use of Utf-16. It is variable length encoded (can not be accessed by index), it has problems to cover whole Unicode range even at present, byte order must be handled, etc. I do not see any advantage except that it is natively used in Windows and some other places. Even though when writing multi-platform code it is probably better to use Utf-8 natively and make conversions only at the end points in platform dependent way (as already suggested). When direct access by index is necessary and memory is not a problem, Utf-32 should be used.
The main problem is that many programmers dealing with Windows Unicode = Utf-16 do not even know or ignore the fact that it is variable length encoded.
The way it is usually in *nix platform is pretty good, c strings (char *) interpreted as Utf-8 encoded, wide c strings (wchar_t *) interpreted as Utf-32.
My guesses as to the why the Windows API (and presumably the Qt libraries) use UTF-16:
When writing to a stream, however, it's considered best practice to use UTF-8 for the reasons outlined in the Joel article referenced above.
Anyone consider this a deja vu from when DBCS had the same problems? What about UTF-8 programs that don't really handle 4-byte chars properly? It is why Windows do not support it as the ANSI codepage. One last thing, what version of Windows did you try this on? I just tried this myself on Chinese Windows 2000 (the first version of Windows that claims to support UTF-16) and the standard edit control do handle it correctly.