Unicode support doesn’t mean your application is internationalized

Over the years, I’ve helped many organizations internationalize their software products. One of the most common misunderstandings is how Unicode will help their product. Customers sometimes mistakenly believe that Unicode support will be sufficient to internationalize their products. Sometimes they believe that Unicode “support” is a single, yes-no, on-off ability, when instead Unicode support is typically implemented in various stages and levels.

Unicode is a character encoding standard. It’s a big standard, with lots of nuances. Your products can implement “Unicode support” in many ways. The result is that those products will be able to manipulate, process, store, and perhaps even display the world’s scripts in a variety of ways BUT not usually in all ways. Your product’s ability to support Unicode is not a binary ability; instead, you should understand that products can have “Unicode support” in a variety of levels. In the most simple case, your product might only store and retrieve Unicode characters correctly. At a more sophisticated level, your product may be able to sort, search, or display Unicode characters. Again, Unicode “support” in a product cannot be evaluated by a single check-box or yes-no answer. Typically, products support Unicode in some ways but not in others.

Implementing even the most sophisticated levels of Unicode support doesn’t mean your product is internationalized. Internationalization is the process of preparing a software code base to be easily localized. Internationalization creates a product that has no particular bias towards a single culture or language. That product can be localized for a specific culture. Unicode support can be a key component of an internationalization effort, but it is only one component. Like Unicode support, your internationalization support will have different levels of sophistication and ability.

To summarize, products can support Unicode in a variety of ways. Supporting Unicode does not usually mean that your product has the ability to perform every possible function on Unicode characters. Instead, “support” usually means that you can do some things with Unicode but probably not others. Additionally, supporting Unicode isn’t the only step to internationalize your products. Unicode is only one step, an important step. Internationalization is the process of creating a product that is easier to localize, one that has cultural biases removed so that a specific culture or locale can be supported more easily after localization. You might use Unicode as a step in your internationalization efforts, but Unicode itself doesn’t create an internationalized product.

Contact me or leave a comment if you have questions about how Unicode can help your product. If I can help, I will. If I can’t, I probably know someone who can.

What is Unicode?

Unicode is a character set standard. This particular standard assigns a unique number to every character used around the globe, regardless of written and spoken language, computing platform, or application. Unicode includes all the characters used from other more limited character sets. Prior to Unicode, smaller character sets assigned character values differently from each other. Unicode unifies all other character sets; every character gets its own, unique value.

You can get more information about Unicode from the Unicode Home Page.

JSR 310, is it time for a new Date concept in Java

JSR 310: A New Java Date/Time API by Jesse Farnham — Java SE’s Date and Calendar classes leave much to be desired. Will the third time be the charm? JSR 310, tracking for inclusion in Java SE 7, once again tries to offer a comprehensive date and time API, borrowing much of its design from the popular Joda Time API. In this article, Jesse Farnham takes a look at JSR 310’s concepts and how they may yet bring sense to dates and times in Java.

Understanding locale in the Java platform

traveling dukeLanguage and geographic environment are two important influences on our culture. They create the system in which we interpret other people and events in our life. They also affect, even define, proper form for presenting ourselves and our thoughts to others. To communicate effectively with another person, we must consider and use that person’s culture, language, and environment.

Read Understanding Locale in the Java Platform for more details about how to use locale in your Java applications.

Japanese input methods on Ubuntu

Adding input methods and font support for Japanese is a trivial process for Windows XP and Vista. After moving my laptop from XP to Ubuntu Linux, I realize that familiarity is…well…comfortable. I’m a little lost.

Really all I want to do is enable the Japanese input methods on this new, shiny Ubuntu 8.04 system. I tried installing SCIM and an input method called “Anthy”. Sigh…I couldn’t get it to work on first try, so I removed it. Of course I’ll try again, but I’ll do some Yahoo/Google search homework first.

Managing resources in the Swing Application Framework (JSR 296)

Instead of loading and working with ResourceBundle files directly, you will use the ResourceManager and ResourceMap framework classes to manage resources. A ResourceMap contains the resources defined in a specific ResourceBundle implementation. A map also contains links to its parent chain of ResoureMap objects. The parent chain for any class includes the ResourceMap for that specific class, the application subclass to which the class belongs, and all superclasses of your application up to the base Application class.

Continue reading ‘Managing resources in the Swing Application Framework (JSR 296)’ »

Encoding URIs and their components


As you pass data from the browser to the application server to the database, opportunities for data loss lurk. I highlighted some of those conversion points earlier, but I neglected a browser issue. The JavaScript layer has its own lossy points of interest. One of those points is the escape function.

The escape function “encodes” a string by replacing non-ASCII letters and some other punctuation symbols with escape sequences of the form %XX, where X is a hex digit. Unicode characters from \u0080 through \u00FF are converted to the %XX form as well. Unicode characters in higher ranges take the form %uXXXX. So, as an example, the name José will take the form Jos%E9. Go ahead, give it a try below:

The problem with this is that the escape mechanism is broken if you want to use UTF-8 as your document encoding. If you were dynamically composing URL strings with parameters, those parameters will definitely not be escaped correctly. Instead of Jos%E9 that URI component should really be Jos%C3%A9.

Fortunately, JavaScript has resolved the problem, but the solution means you’ll have to use another function. The escape function is deprecated in ECMAScript v 3. Instead, you should use the function encodeURI or encodeURIComponent. These functions convert their argument to the UTF-8 encoding and then %XX encode all the non-ASCII characters. Two forms of the function exist so that you have greater control over whether characters like “?” and “&” are encoded. You’ll need to check your documentation for details. You can experiment with the encodeURIComponent function here:

What’s this mean for you? Maybe nothing if you’re hopelessly attached to ISO-8859-1. However, if you’re trying to reach a global market with your product, chances are very good that you’ve decided to use UTF-8 for your character set encoding. That’s an excellent choice, but you’ll have to manage the conversion points. In a nutshell, that simply means that you’ll need to use UTF-8 from front to back consistently.

Part of managing those conversion points is consistently providing well-formed URIs to your application server. If you use JavaScript to manipulate data or to create dynamic URIs in your application, make sure you toss aside that deprecated escape function. Take a look at encodeURI and encodeURIComponent instead.

International Domain Names

The Java SE 6 release provides an interesting new class: java.net.IDN. It’s small, simple…very focused on a single task. That task has two parts:

  1. to convert domain names from practically any Unicode character to an ASCII Compatible Encoding or ACE.
  2. to convert ACE names back into their full Unicode UTF-16 encoding

Continue reading ‘International Domain Names’ »

JavaScript file encoding

Although JavaScript itself uses Unicode internally, you can still run into charset conversion problems. Consider the following example of charset conversion issues with a very simple HTML and JS file. Continue reading ‘JavaScript file encoding’ »

Changing project encodings in NetBeans 6.5

I reported that NetBeans 6.1’s project charset encoding feature would allow an unsuspecting user to destroy file data. That’s still true…through no fault of NetBeans really. It’s just a matter of fact — if you start out with UTF-8 and convert your project files to ASCII or ISO-8859-1 or any other subset of Unicode, you will lose any characters that are not also in the target charset.

Continue reading ‘Changing project encodings in NetBeans 6.5’ »