Copyright © 2001 John O’Conner
Unicode is a character set supported across many commonly used software applications and operating systems. For example, many popular web browser, e-mail, and word processing applications support Unicode. Supporting operating systems include Solaris, Linux, Windows 2000, and Macintosh’s OS X. Applications that support Unicode are often capable of displaying multiple languages and scripts within the same document. In a multilingual office or business setting, Unicode’s importance as a universal character set cannot be overlooked.
Unicode is the only practical character set option for applications that support multilingual documents. However, applications do have several options for how they encode Unicode. An encoding is the mapping of Unicode code points to a stream of storable code units or octets. The most common encodings include the following:
Table 1 Common Definitions
|
Character Set |
A repertoire of characters that have been collected together for some purpose. |
|
Coded Character Set |
An ordered character set in which each character has an assigned integer value. |
|
Code Point |
The integer value of a character within a coded character set. |
|
Character Encoding |
A mapping of code points to a series of bytes. |
|
Code Unit |
A single octet or byte of an encoded character. |
|
Charset |
A set of characters that has been encoded using a character encoding. Often used as a synonym for character encoding. |
Each encoding has advantages and drawbacks. However, one encoding in particular has gained widespread acceptance. That is UTF-8. This article describes UTF-8. It describes what it is and why it is important.
Unicode 3.1 code points exist in the range
U+0000→U+10FFFF. Although each of the code points can be stored and
manipulated as 32-bit integers, convincing the world to use a 32 bit wide
character encoding won’t be immediately successful everywhere. This is especially
true for Western European and non-Asian nations in general, which can encode
their legacy character sets in as little as 1 byte per character.
UTF-8 is a multi-byte encoding in which each character can be encoded in as little as 1 byte and as many as 4 bytes. Most Western European languages require less than 2 bytes per character. For example, characters from Latin based scripts require only 1.1 bytes on average. Greek, Arabic, Hebrew, and Russian require an average of 1.7 bytes. Finally, Japanese, Korean, and Chinese typically require 3 bytes per character.[1]
The encoding algorithm is straightforward. The table below shows how bits from a Unicode code point are arranged in the encoding for different character ranges:
Table 2 UTF-8 bit encoding of a Unicode codepoint
|
Character Range |
Bit Encoding |
|
|
|
|
|
|
|
|
|
|
|
|
As the above table shows, characters in the range U+0000→U+007F
can be encoded as a single byte. This means that the ASCII
charset can be represented unchanged with a single byte of storage space. The
next range, U+0080→U+07FF,
contains the remaining characters for most of the world’s scripts and includes
characters with diacritics. This range requires 2 bytes of encoded storage. The
notable scripts in the range U+0800→U+FFFF
are Chinese, Korean, and Japanese. These scripts require 3 bytes of storage for
each character. Finally, the non-BMP range contains characters that can be
represented as surrogate pairs in UTF-16. Most of the new characters in this
range are Chinese ideographs. The newly defined characters in this range
require 4 bytes in the UTF-8 encoding.
Algorithms for producing a UTF-8 encoded character can be very simple. The following Java code shows how you can easily create your own UTF-8 encoder[2]:
/**
* Converts an array of Unicode scalar values (code points) into
* UTF-8. This algorithm works under the assumption that all
* surrogate pairs have already been converted into scalar code
* point values within the argument.
*
* @param ch an array of Unicode scalar values (code points)
* @returns a byte[] containing the UTF-8 encoded characters
*/
public static byte[] encode(int[] ch) {
// determine how many bytes are needed for the complete conversion
int bytesNeeded = 0;
for (int i=0; i<ch.length; i++) {
if (ch[i] < 0x80) {
++bytesNeeded;
}
else if (ch[i] < 0x0800) {
bytesNeeded += 2;
}
else if (ch[i] < 0x10000) {
bytesNeeded += 3;
}
else {
bytesNeeded += 4;
}
}
// allocate a byte[] of the necessary size
byte[] utf8 = new byte[bytesNeeded];
// do the conversion from character code points to utf-8
for(int i=0, bytes = 0; i<ch.length; i++) {
if(ch[i] < 0x80) {
utf8[bytes++] = (byte)ch[i];
}
else if (ch[i] < 0x0800) {
utf8[bytes++] = (byte)(ch[i] >> 6 | 0xC0);
utf8[bytes++] = (byte)(ch[i] & 0x3F | 0x80);
}
else if (ch[i] < 0x10000) {
utf8[bytes++] = (byte)(ch[i] >> 12 | 0xE0);
utf8[bytes++] = (byte)(ch[i] >> 6 & 0x3F | 0x80);
utf8[bytes++] = (byte)(ch[i] & 0x3F | 0x80);
}
else {
utf8[bytes++] = (byte)(ch[i] >> 18 | 0xF0);
utf8[bytes++] = (byte)(ch[i] >> 12 & 0x3F | 0x80);
utf8[bytes++] = (byte)(ch[i] >> 6 & 0x3F | 0x80);
utf8[bytes++] = (byte)(ch[i] & 0x3F | 0x80);
}
}
return utf8;
}
UTF-8 is an important encoding because of the following reasons:
At the recent Unicode Conference in Hong Kong, one company
said that their move to Unicode was simplified by the adoption of UTF-8.
Instead of changing their products code to support 16-bit or 32-bit wide
Unicode characters, they chose UTF-8 instead. What was their reason? They said
that their system had lots of hard-coded comparisons to find specific ASCII
characters in text. Instead of modifying their code everywhere, they simply
changed their character encoding to UTF-8, which is compatible with ASCII. In
other words, single byte ASCII characters retain their encoded value in UTF-8.
For example, code that checks for a ‘\’ can continue checking for the byte
value 0x5C instead of changing the code to check for
0x005C. Modifying hundreds of lines of text
processing code scattered throughout thousands of lines of miscellaneous code
can be time consuming and error prone. Sometimes selecting the UTF-8 encoding
can provide the easiest and most cost effective way to get a basic level of
Unicode support in a legacy application.
Most applications have basic text handling algorithms. Many
of those algorithms make flawed assumptions about a character’s storage
requirements. For example, many programmer’s assume that a character requires
only a single byte of storage. Another common assumption, especially for C
programmers, is that a text string never contains the value 0x00.
If this value does appear, it typically marks the end of the text string.
Encodings like UTF-16 and UTF-32 store characters as 16 or 32-bit values. When
a string of 16 or 32 bit values are processed as a series of byte values, the
value 0x00 often appears, especially in Latin-based
scripts. This complicates and confuses existing text processing algorithms,
leading to miscalculated string lengths, oddly concatenated strings, and search
failures. On the other hand, because UTF-8’s basic code unit is a byte, legacy
algorithms can typically run with only minor adjustments if any.
One complaint often aimed at Unicode is that it requires so much more space than legacy encodings for Latin-based scripts. In other words, UTF-16 or UTF-32 require 16 or 32 bits of storage for most characters instead of a single byte required by the series of ISO-8859 encodings. However, UTF-8 stores the ASCII subset of all these charsets in as little as 1 byte. The ASCII subset is definitely the most used set of characters for Western European and American languages. As mentioned earlier, most Western European languages can be written with 1.1 bytes per character on average. This is almost as efficient as ASCII, but it allows for up to 4 bytes per character for rare characters and obscure scripts when necessary.
Although many new development projects standardize quickly on Unicode, older projects often used legacy character sets that supported a small set of related languages. Long time internationalization and localization engineers remember updating text processing algorithms to handle both “single-byte” and “multi-byte” character sets. Do you remember updating your code to check “lead” bytes and possibly “trail” bytes during processing? Remember how difficult it was to find the beginning of a character if your index into the text was an arbitrary location? The problem was that trail bytes could also be lead bytes in some encodings. The Shift-JIS encoding, for example, was difficult to process backwards for this reason.
When Unicode became available as a fixed-width 16-bit
encoding, many of us were excited to toss out multi-byte encodings.
Understandably, you may be hesitant to adopt a multi-byte Unicode encoding
after all the troubles you may have had with multi-byte Asian character sets.
However, UTF-8 is different, and it doesn’t have all of the same problems as
those legacy encodings. For example, it is much easier to find the start of a
character from any arbitrary point in a text string. So called “trail” bytes of
a UTF-8 character sequence always have the bit pattern 10xxxxxx,
so it is easy to find one’s way back to the beginning of a character. A
character pointer is at most 3 bytes away from the character’s beginning. Even
with most Asian ideographs, character boundaries are at most just a couple
bytes away. Figure 1 shows several characters and their encoding in UTF-8.
Notice the hexadecimal byte sequence E5,
AD, 97.
Asked to find the character’s beginning from the location marked 1,
we could proceed as follows to find the character boundary at location 2 in the
figure:
- does the current byte start with the bit pattern 10xxxxxx?
- if yes, move left and go to step #1.
- finished.
Figure 1 Finding character boundaries is relatively simple.

Unlike some legacy character encodings, UTF-8 is fairly easy to parse and manipulate. The bit patterns of the encoding allow you to quickly determine whether your character index points to a character’s beginning or somewhere else. Moving backward or forward within a string is easy.
UTF-8 is a compact, efficient Unicode encoding. The encoding distributes a Unicode code value’s bit pattern across 1, 2, 3, or even 4 bytes. This encoding is a multi-byte encoding.
UTF-8 encodes ASCII in only 1 byte. That means that languages that use Latin-based scripts can be represented with only 1.1 bytes per character on average. Other languages may require more bytes per character. Only the Asian scripts have significant encoding overhead in UTF-8 compared to UTF-16.
UTF-8 is useful for legacy systems that want Unicode support because developers don’t have to drastically modify text processing code. Code that assumes single byte code units typically don’t fail completely when provided UTF-8 text instead of ASCII or even Latin-1.
Finally, unlike some legacy encodings, UTF-8 is easy to parse. So called “lead” and “trail” bytes are easily distinguished. Moving forward or backwards in a text string is easier in UTF-8 than many other multi-byte encodings.
[1] Forms of Unicode, Mark Davis, September 1999, http://www.ibm.com/developerworks/unicode/library/utfencodingforms/index.html.
[2] This code has not been optimized for size or speed.