Editing International Components for Unicode

{{Short description|Software library}}
{{Infobox software
| name = International Components for Unicode
| title = 
| logo = <!-- [[File: ]] -->
| logo caption = 
| screenshot = <!-- [[File: ]] -->
| caption = 
| collapsible = 
| author = 
| developer = [[Unicode Consortium]]
| released = 1999<!-- {{Start date|YYYY|MM|DD|df=yes/no}} -->
| discontinued = 
| latest release version = {{wikidata|property|preferred|references|edit|P348|P548=Q2804309}}       | latest release date    = {{Start date and age|{{wikidata|qualifier|preferred|single|P348|P548=Q2804309|P577}}|df=yes}}
| latest preview version = 
| latest preview date = 
| programming language = [[C (programming language)|C]]/[[C++]] ([[C++11]]<!-- seemingly also needing C++[11] compiler for using from C code: "ICU4C requires C++11 and has been tested with up to C++20.". TODO: In future ICU 75 [[C11 (C standard revision)|C11]]/[[C++17]] for using-->) and [[Java (programming language)|Java]] 8+
| operating system = [[Cross-platform]]
| platform = 
| size = 
| language = 
| language count = <!-- DO NOT include this parameter unless you know what it does -->
| language footnote = 
| genre = [[Library (computer science)|Libraries]] for [[Unicode]] and [[internationalization and localization|internationalization]]
| license = [https://github.com/unicode-org/icu/blob/main/LICENSE Unicode License]
| website = {{URL|https://icu.unicode.org/}}
| standard = 
| AsOf = 
}}

'''International Components for Unicode''' ('''ICU''') is an [[open-source software|open-source]] project of mature [[C (programming language)|C]]/[[C++]] and [[Java (programming language)|Java]] libraries for [[Unicode]] support, software [[internationalization and localization|internationalization]], and software globalization. ICU is widely portable to many operating systems and environments. It gives applications the same results on all platforms and between C, C++, and Java software. The ICU project is a technical committee of the [[Unicode Consortium]] and sponsored, supported, and used by [[IBM]] and many other companies.<ref>{{Cite web|url=http://site.icu-project.org/home|title=ICU - International Components for Unicode|website=site.icu-project.org|access-date=2011-11-14|archive-date=2021-08-27|archive-url=https://web.archive.org/web/20210827203044/http://site.icu-project.org/home|url-status=dead}}</ref> ICU has been included as a standard component with [[Microsoft Windows]] since [[Windows 10]] version 1703.<ref>{{cite web |url=https://devblogs.microsoft.com/oldnewthing/20210527-00/?p=105255 |title=How can I convert between IANA time zones and Windows registry-based time zones? |work=The Old New Thing |last=Chen |first=Raymond |date=27 May 2021 |publisher=[[Microsoft]]}}</ref>

ICU provides the following services: [[Unicode]] text handling, full character properties, and [[character set]] conversions; Unicode [[regular expression]]s; full Unicode sets; character, word, and line boundaries; language-sensitive [[collation]] and searching; [[Unicode normalization|normalization]], upper and lowercase conversion, and script [[transliteration]]s; comprehensive [[locale (computer software)|locale]] data and resource bundle architecture via the [[Common Locale Data Repository]] (CLDR); multiple [[calendar]]s and [[time zone]]s; and rule-based formatting and parsing of dates, times, numbers, currencies, and messages. ICU provided [[complex text layout]] service for Arabic, Hebrew, Indic, and Thai historically, but that was deprecated in version 54, and was completely removed in version 58 in favor of [[HarfBuzz]].<ref>{{Cite web|url=http://userguide.icu-project.org/layoutengine|title=Layout Engine - ICU User Guide|website=userguide.icu-project.org}}</ref>

ICU provides more extensive internationalization facilities than the standard libraries for C and C++. Future ICU 75 planned for April 2024 will require [[C++17]] (up from [[C++11]]) or [[C11 (C standard revision)|C11]] (up from C99), depending on what languages is used. ICU has historically used [[UTF-16]], and still does only for Java; while for C/C++ [[UTF-8]] is supported,<ref name ="UTF-8" /><ref>{{Cite web|url=http://userguide.icu-project.org/strings/utf-8|title=UTF-8 - ICU User Guide|website=userguide.icu-project.org|access-date=2018-04-03}}</ref> including the correct handling of "illegal UTF-8".<ref>{{Cite web|url=http://bugs.icu-project.org/trac/ticket/13311|title=#13311 (change illegal-UTF-8 handling to Unicode "best practice") |website=bugs.icu-project.org|access-date=2018-04-03}}</ref>

ICU 73.2 has improved significant changes for [[GB18030]]-2022 compliance support, i.e. for Chinese (that updated Chinese GB18030 [[Unicode Transformation Format]] standard is slightly incompatible); has "a modified character conversion table, mapping some GB18030 characters to Unicode characters that were encoded after GB18030-2005" and has a number of other changes such as improving Japanese and Korean short-text line breaking, and in "English, the name “Türkiye” is now used for the country instead of “Turkey” (the alternate spelling is also available in the data)."<ref>{{Cite web |title=ICU - International Components for Unicode - ICU 73 |url=https://icu.unicode.org/download/73 |access-date=2023-09-24 |website=icu.unicode.org |language=en-US}}</ref>

ICU 74 "updates to Unicode 15.1, including new characters, emoji, security mechanisms, and corresponding APIs and implementations. <!-- It also updates to CLDR 44 (blog) locale data with new locales and various additions and corrections. --> [..]
ICU 74 and CLDR 44 are major releases, including a new version of Unicode and major locale data improvements."<!-- They subsume the changes for the ICU 73.2 and CLDR 43.1 maintenance releases.--><ref>{{Cite web |title=ICU - International Components for Unicode - ICU 74 |url=https://icu.unicode.org/download/74 |access-date=2023-11-29 |website=icu.unicode.org |language=en-US}}</ref> Of the many changes some are for person name formatting, or for improved language support, e.g. for [[Low German]], and there's e.g. a new spoof checker API, following the (latest version) [[Unicode 15]].1.0 UTS #39: Unicode Security Mechanism.

==Older version details==
ICU 72 updated to [[Unicode 15]] (and 73.2 to latest 15.1). "In many formatting patterns, ASCII '''[[space (punctuation)|spaces]]''' are replaced with Unicode spaces (e.g., a "[[thin space]]")." ICU (ICU4J) now requires Java 8 but "Most of the ICU 72 library code should still work with Java 7 / Android API level 21, but we no longer test with Java 7."<ref>{{Cite web |title=ICU - International Components for Unicode - ICU 72 |url=https://icu.unicode.org/download/72 |access-date=2023-01-24 |website=icu.unicode.org |language=en-US}}</ref> ICU 71 added e.g. phrase-based line breaking for Japanese (earlier methods didn't work well for short Japanese text, such as in titles and headings) and support for Hindi written in Latin letters (hi_Latn), also referred to as "[[Hinglish]]". ICU 70 added e.g. support for [[emoji]] properties of strings and can now be built and used with [[C++20]] compilers (and "ICU operator==() and operator!=() functions now return bool instead of UBool, as an adjustment for incompatible changes in C++20"),<ref>{{Cite web |title=ICU - International Components for Unicode - ICU 70 |url=https://icu.unicode.org/download/70 |access-date=2023-01-24 |website=icu.unicode.org |language=en-US}}</ref> and as of that version the minimum Windows version is [[Windows 7]]. ICU 67 handles [[Brexit|removal of Great Britain from the EU]]. ICU 64.2 added support for Unicode 12.1, i.e. the single new symbol for current Japanese [[Reiwa era]] (but support for it has also been backported to older ICU versions down to ICU 4.8.2). ICU 58 (with Unicode 9.0 support) is the last version to support older platforms such as [[Windows XP]] and [[Windows Vista]]. Support for [[IBM AIX|AIX]], [[Solaris (operating system)|Solaris]] and [[z/OS]] may also be limited in later versions (i.e. building depends on compiler support).<ref>{{Cite web|url=http://site.icu-project.org/download/64|title=Download ICU 64 - ICU - International Components for Unicode|website=site.icu-project.org|access-date=2019-10-20}}</ref>
<!--
On ICU 74: "For more details, including migration issues, see below.
[..]
* [for] amendment to China’s GB 18030-2022 standard.
* It also adds five specialty characters for use with ideographs, six new emoji, and a number of emoji variations.
* For several southeast Asian scripts, line breaking is now done around orthographic syllables.
* Unicode has improved its security-related specifications. There is a new Unicode Technical Standard, UTS #55 “Unicode Source Code Handling”, and there are related changes in other Unicode specifications.

* CLDR 44 (blog):
** CLDR has added or improved data for the following languages which are newly included in ICU:
*** Anii (blo), Swampy Cree (csw), Interlingue (ie), Kuvi (kxw), Ligurian (lij), Lombard (lmo), Low German (nds), N’Ko (nqo), Occitan (oc), Prussian (prg), Syriac (syr), Silesian (szl), Toki Pona (tok), Venetian (vec), Makhua (vmw), Kangri (xnr), Zhuang (za)
* [..]
** The person name formatting spec & data has been further developed.
* New TimeZone API for getting the “primary” IANA time zone ID, rather than the CLDR-canonical ID. (ICU-22452)
* New Normalizer2 factory method for Unicode NFKC_Simple_Casefold normalization. (ICU-22404)
Measurement unit data, conversions, and display names (translations) are improved. There are some new units.
Improved data for likely subtags.
* New spoof checker API for taking text direction (left-to-right vs. right-to-left) into account for confusability and generating skeletons. (ICU-22332)
* Time zone data (tzdata) version 2023c (2023-mar). Note that pre-1970 data for a number of time zones has been removed, as has been the case in the upstream tzdata release since 2021b.


* API changes since ICU4C 73 (Markdown) / (HTML)
* New C API wrappers for the Locale and LocaleBuilder classes (ULocale [ICU-22435] & ULocaleBuilder [ICU-22365])
* For BreakIterator, there is a technology preview for registering a sub-break engine (e.g., dictionary or machine learning based) via a new ExternalBreakEngine interface. (ICU-22342)
[..]
* API Changes since ICU4J 73
* ICU4J has switched from ant to Maven, and rearranged the source file tree to the Maven default.
[..]
* The draft PersonNameFormatter has been updated to match the improved CLDR person name formatting spec & data."
--><!--


CLDR 44 has a long list of Known Issues (while ICU has "none"), see full list at:
https://cldr.unicode.org/index/downloads/cldr-44#h.vax7o49mgyok

here partial:
* The region-based firstDay value (see weekData) is currently [..]
* Use 44.0.1 for CLDR 44 JSON NPM since 44.0.0 was tagged incorrectly.
* unicodeVersion in ldmlSupplemental.dtd was not updated to 15.1 See CLDR-17225


In CLDR 44, the focus is on:
1. Formatting Person Names. Added further enhancements (data and structure) for formatting people's names. For more information on why this feature is being added and what it does, see Background.
2. Emoji 15.1 Support. [..]
3. Unicode 15.1 additions. Made the regular additions and changes for a new release of Unicode, including names for new scripts, collation data for Han characters, etc.
4. *Digitally disadvantaged language coverage*. Work began to improve DDL coverage, with the following DDL locales now having higher coverage levels:
* 1. Modern: Cherokee, Lower Sorbian, Upper Sorbian
* 2. Moderate: Anii, Interlingua, Kurdish, Māori, Venetian
* 3. Basic: Esperanto, Interlingue, Kangri, Kuvi, Kuvi (Devanagari), Kuvi (Odia), [..]


BCP47 Changes
* The Islamic calendar is now described as Hijri calendar in English, and may have also changed in other locales.

Supplemental Data Changes
* New locales were added, including en_ID and es_JP, plus many locales at a Basic level.
* Fixes
** There was a fix made for the Zanb script, which was mistakenly categorized as special instead of regular.
** There was a fix made to the BCP47 Latin↔︎ASCII transliterator ID
* Units
* The gasoline-energy-density unit (used in miles per gallon of gasoline equivalent (MPGe) for electric vehicles) and the pint-imperial (used in the UK), plus many Japanese traditional units were added.
* The unit of wind speed, Beaufort, was added for translation in locales where it is used.
Remaining SI units were added. Because these are primarily of use in scientific fields, they are not translated.
* A few traditional English units were added, such as chain and fortnight. These were not translated.
* [..]

JSON Data changes
* Available at: https://github.com/unicode-org/cldr-json/releases/tag/44.0.0 

Keyboard has a new DTD (keyboard3.dtd and the <keyboard3> element). [also notes on and known issues:]

* CLDR-17204 - there was an error in the DTD in the <locale> element. It is corrected in the linked PR. 
* CLDR-17205 - the keyboard charts are known to be broken. A fix is in progress.


Migrations
* Unit systems provide information about general usage of units of measure. For example, "knot" is in the customary US and UK systems, but is also acceptable for use with SI.
* [..]
* Preferred hour formats indicate the preferred form for a locale: 11 PM vs 23:00 vs 11 in the evening.
** Have changed substantially for many Latin American countries
* Keyboard has a new DTD (keyboard3.dtd and the <keyboard3> element). See the “Keyboard Changes” section.
* PersonNames: In the process of moving out of Tech Preview, there were structure additions but also changes:"
-->

==Origin and development==
After [[Taligent]] became part of [[IBM]] in early 1996, [[Sun Microsystems]] decided that the new Java language should have better support for internationalization. Since Taligent had experience with such technologies and were close geographically, their Text and International group were asked to contribute the international classes to the [[Java Development Kit]] as part of the [[JDK]] 1.1 internationalization [[API]]s.<ref>{{cite web |url=http://www.icu-project.org/docs/papers/history_of_java_internationalization.html |title=Getting Java ready for the world: A brief history of IBM and Sun's internationalization efforts |author=Laura Werner |year=1999 |access-date=2007-05-23 |archive-date=2021-11-17 |archive-url=https://web.archive.org/web/20211117041114/https://icu-project.org/docs/papers/history_of_java_internationalization.html |url-status=dead }}</ref> A large portion of this code still exists in the {{Javadoc:SE|package=java.text|java/text}} and {{Javadoc:SE|package=java.util|java/util}} packages. Further internationalization features were added with each later release of Java.

The Java internationalization classes were then ported to C++ and C<ref>{{Cite web|url=http://userguide.icu-project.org/intro|title=ICU User Guide|website=userguide.icu-project.org}}</ref> as part of a library known as ICU4C ("ICU for C"). The ICU project also provides ICU4J ("ICU for Java"), which adds features not present in the standard Java libraries. ICU4C and ICU4J are very similar, though not identical; for example, ICU4C includes a Regular Expression API, while ICU4J does not. Both frameworks have been enhanced over time to support new facilities and new features of Unicode and [[Common Locale Data Repository]] (CLDR).

ICU was released as an open-source project in 1999 under the name IBM Classes for Unicode. It was later renamed to International Components For Unicode.<ref>{{cite web |url=http://site.icu-project.org/projectinfo |title=ICU Project Management Committee |access-date=2012-08-17 |archive-date=2021-08-28 |archive-url=https://web.archive.org/web/20210828201302/http://site.icu-project.org/projectinfo |url-status=dead }}</ref> In May 2016, the ICU project joined the Unicode consortium as technical committee ''ICU-TC'', and the library sources are now distributed under the Unicode license.<ref>{{cite web |url=http://blog.unicode.org/2016/05/icu-joins-unicode-consortium.html |title=ICU joins the Unicode Consortium |date=2016-05-16 |publisher=[[Unicode|Unicode, Inc.]] |access-date=2016-08-01}}</ref>

==MessageFormat==
A part of ICU is the '''MessageFormat''' class, a formatting system that allows for any number of arguments to control the plural form ({{code|plural}}, {{code|selectordinal}}) or more general [[switch-case]]-style selection ({{code|select}}) for things like [[grammatical gender]]. These statements can be nested.<ref name=icu-mf>{{cite web |title=Formatting Messages |url=http://userguide.icu-project.org/formatparse/messages |website=ICU User Guide}}</ref> ICU MessageFormat was created by adding the plural and selection system to an identically-named system in [[Java SE]].

==Alternatives==
An alternative for using ICU with [[C++]], or to using it directly, is to use Boost.Locale, which is a C++ wrapper for ICU (while also allowing other backends<ref>{{Cite web |title=Boost.Locale: Using Localization Backends |url=https://www.boost.org/doc/libs/1_54_0/libs/locale/doc/html/using_localization_backends.html |access-date=2022-05-24 |website=www.boost.org}}</ref>). The claim for using it rather than ICU directly is that "is absolutely unfriendly to C++ developers. It ignores popular C++ idioms (the STL, RTTI, exceptions, etc), instead mostly mimicking the Java API."<ref>{{Cite web |title=Boost.Locale: Design Rationale |url=https://www.boost.org/doc/libs/1_49_0/libs/locale/doc/html/rationale.html#why_icu |access-date=2022-05-24 |website=www.boost.org}}</ref><ref>{{Cite web |title=ICU vs Boost Locale in C++ |url=https://stackoverflow.com/questions/9494396/icu-vs-boost-locale-in-c |access-date=2022-05-24 |website=Stack Overflow |language=en}}</ref> Another claim, that ICU only supports UTF-16 (and thus a reason to avoid using ICU) is no longer true with ICU now also supporting UTF-8 for C and C++.<ref name ="UTF-8">{{Cite web |title=UTF-8 |url=https://unicode-org.github.io/icu/userguide/strings/utf-8.html |access-date=2022-05-24 |website=ICU Documentation |language=en-US}}</ref>

==See also==
* [[Apple Advanced Typography]]
* [[Apple Type Services for Unicode Imaging]]
* [[gettext]]
* [[Graphite (smart font technology)]]
* [[NetRexx]] (ICU license)
* [[OpenType]]
* [[Pango]]
* [[Uconv]]
* [[Uniscribe]]

==References==
{{Reflist|30em}}

==External links==
* {{Official website}}
* [https://icu4c-demos.unicode.org/icu-bin/translit/ International Components for Unicode transliteration services]
* [https://devpal.co/icu-message-editor/ Online ICU editor]

{{Unicode navigation}}

[[Category:Unicode]]
[[Category:Component-based software engineering]]
[[Category:Digital typography]]
[[Category:Pattern matching]]
[[Category:Internationalization and localization]]
[[Category:Free computer libraries]]