Google Open Source Blog: Unicode

Happy Birthday, ICU!

Posted: Tuesday, May 19, 2009

The ICU project is celebrating 10 years of being open source this month.

"ICU" in this case stands for International Components for Unicode - not to be confused with Intensive Care Unit or International Communist Union... It is the premier software internationalization library, appearing in everything from your Google Android phone or your iPod all the way up to IBM mainframes. It provides the Unicode support that all of these programs need for handling the languages of the world, from Arabic to Chinese to Vietnamese.

ICU originated back in an Apple/IBM/HP joint venture. That code was morphed into the core of Java internationalization for JDK 1.1.4 - a large portion of this code still exists in the java.text and java.util packages. At that time, it included pretty much just sorting, locale/message support, and formatting for dates, numbers and so on. (If you're interested in early history, see an older paper by Laura Werner - now at Google). The libraries were refined over time and ported back to C and C++; now there are also wrappers for other languages, such as PHP.

ICU's data comes from the Unicode Consortium's open source project for locale data - CLDR - and typically releases each new version right after CLDR does. CLDR 1.7 was just released Friday, May 8, with ICU 4.2 following on the very same day.

While ICU was around before Google, more recently Google has played a strong role in the development of ICU, and in providing major contributions to the Unicode CLDR project. ICU forms the foundation of our 40 language initiative, so we look forward to many successful future birthdays!

By Mark Davis, Internationalization team

Update on emoji4unicode

Posted: Thursday, March 19, 2009

By Kat Momoi, Software Engineer and Emoji Encoding Project member & Takeshi Kishimoto, Gmail Product Manager

Last November Markus Scherer wrote about Google's efforts to encode Emoji (絵文字), or "picture characters," the graphical versions of :-) and its friends, and about the creation of the open source project emoji4unicode. Based on the feedback to the project from a variety of people, we have updated and refined the list of Emoji characters that need to be encoded as new characters in an upcoming version of the Unicode Standard and ISO/IEC 10646. The final list reflects all known Emoji characters used by the 3 major mobile phone companies in Japan (NTT docomo, KDDI/AU and SoftBank Mobile), excluding the ones that already are encoded in the current Unicode Standard.

The Unicode Standard and ISO 10646 are synchronized in terms of character content, and jointly controlled by the Unicode Consortium and the ISO/IEC JTC1/SC2 /WG2 committee. Google and Apple jointly submitted The Emoji Encoding Proposal to the Unicode Consortium and it was formally approved by the Unicode Technical Committee on February 6, 2009. Now this proposal goes to ISO/IEC JTC1/SC2/WG2 for deliberations and approval as a joint contribution to ISO by the Unicode Consortium and the US national body. The approval by the Unicode Consortium is thus a major milestone toward getting the characters into Unicode. The next SC2/WG2 meeting takes place in late April, 2009 in Dublin, Ireland and the Emoji proposal will be discussed there. We encourage you to read a copy of the proposal submitted to SC2/WG2.

Since we wrote the first blog post on this topic, we have received positive feedback for this proposal. We are both surprised and gratified to learn that there are so many people with a strong interest in Emoji, and we believe that further support from the community will greatly aid in advancing the cause of the Emoji Encoding project.

Emoji for Unicode: Open Source Data for the Encoding Proposal

Posted: Wednesday, November 26, 2008

By Markus Scherer, Google Internationalization Engineering

Emoji (絵文字), or "picture characters", the graphical versions of :-) and its friends, are widely used and especially popular among Japanese cell phone users. Just last month, they became available in Gmail ― see the team's announcement: A picture is worth a thousand words.

These symbols are encoded as custom (carrier-specific) symbol characters and sent as part of text messages, emails, and web pages. In theory, they are confined to each cell phone carrier's network unless there is an agreement and a converter in place between two carriers. In practice, however, people expect emoji just to work - what they put into a message will get to all the recipients; what they see on a web page will be seen by others; if they search for a character they'll find it. For that to really work well, these symbol characters need to be part of the Unicode Standard (the universal character set used in modern computing).

There are active, on-going efforts to standardize a complete set of emoji as regular symbols characters in Unicode. This involves determining which symbols are already covered in Unicode, and which new symbols would be needed. We're trying to help this effort along by sharing all of our mapping data and tools in the form of the "emoji4unicode" open source project. The goal is more effective collaboration with other members of the Unicode Consortium and review by the cell phone carriers and other interested parties. By making these tools and mappings available, we hope to assist and accelerate the encoding process. Take a look at the documentation, browse the data and tools and let us know what you think.