Editing Unicode (section)

== Issues ==
=== Character unification ===
==== Han unification ====
{{Main|Han unification}}

The [[Ideographic Research Group]] (IRG) is tasked with advising the Consortium and ISO regarding Han unification, or Unihan, especially the further addition of CJK unified and compatibility ideographs to the repertoire. The IRG is composed of experts from each region that has historically used [[Chinese characters]]. However, despite the deliberation within the committee, Han unification has consistently been one of the most contested aspects of ''The Unicode Standard'' since the genesis of the project.<ref>[http://tronweb.super-nova.co.jp/characcodehist.html A Brief History of Character Codes], Steven J. Searle, originally written [https://web.archive.org/web/20001216022100/http://tronweb.super-nova.co.jp/characcodehist.html 1999], last updated 2004</ref>

Existing character set standards such as the Japanese [[JIS X 0208]] (encoded by [[Shift JIS]]) defined unification criteria, meaning rules for determining when a [[variant Chinese character]] is to be considered a handwriting/font difference (and thus unified), versus a spelling difference (to be encoded separately). Unicode's character model for CJK characters was based on the unification criteria used by JIS X 0208, as well as those developed by the Association for a Common Chinese Code in China.<ref name="tus-appe">{{cite web |url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/appendix-e/ |title=Appendix E: Han Unification History |work=The Unicode Standard Version 16.0 – Core Specification |publisher=[[Unicode Consortium]] |date=2024}}</ref>

Due to the standard's principle of encoding semantic instead of stylistic variants, Unicode has received criticism for not assigning code points to certain rare and archaic [[kanji]] variants, possibly complicating processing of ancient and uncommon Japanese names. Since it places particular emphasis on Chinese, Japanese and Korean sharing many characters in common, Han unification is also sometimes perceived as treating the three as the same thing.<ref name="dw2001">{{Cite web |last = Topping |first = Suzanne |date=2013-06-25 |title=The secret life of Unicode |website=[[IBM]] |url=https://www.ibm.com/developerworks/library/u-secret.html |access-date=20 March 2023 |archive-url=https://web.archive.org/web/20130625062705/http://www.ibm.com/developerworks/library/u-secret.html |archive-date=25 June 2013 }}</ref> Regional differences in the expected forms of characters, in terms of typographical conventions and curricula for handwriting, do not always fall along language boundaries: although [[Hong Kong]] and [[Taiwan]] both write [[Chinese languages]] using [[Traditional Chinese]] characters, the preferred forms of characters differ between Hong Kong and Taiwan in some cases.<ref name="irgn2074">{{cite web |url=https://www.unicode.org/irg/docs/n2074-HKCS.pdf |id=[[ISO/IEC JTC 1|ISO/IEC JTC1]]/[[ISO/IEC JTC 1/SC 2|SC2]]/WG2/[[Ideographic Research Group|IRG]] N2074 |last=Lu |first=Qin |title=The Proposed Hong Kong Character Set |date=2015-06-08}}</ref>

Less-frequently-used alternative encodings exist, often predating Unicode, with character models differing from this paradigm, aimed at preserving the various stylistic differences between regional and/or nonstandard character forms. One example is the [[TRON (encoding)|TRON Code]] favored by some users for handling historical Japanese text, though not widely adopted among the Japanese public. Another is the [[CCCII]] encoding adopted by library systems in [[Hong Kong]], [[Taiwan]] and the [[United States]]. These have their own drawbacks in general use, leading to the [[Big5]] encoding (introduced in 1984, four years after CCCII) having become more common than CCCII outside of library systems.<ref name="hanazono">{{cite web |url=http://kura.hanazono.ac.jp/paper/codes.html |archive-url=https://web.archive.org/web/20041012135645/http://kura.hanazono.ac.jp/paper/codes.html |archive-date=2004-10-12 |url-status=dead |title=Chinese character codes: an update |first=Christian |last=Wittern |date=1995-05-01 |publisher=International Research Institute for Zen Buddhism / [[Hanazono University]]}}</ref> Although work at [[Apple Computer|Apple]] based on [[Research Libraries Group]]'s CJK Thesaurus, which was used to maintain the EACC variant of CCCII, was one of the direct predecessors of Unicode's [[Unihan]] set, Unicode adopted the JIS-style unification model.<ref name="tus-appe"/>

The earliest version of Unicode had a repertoire of fewer than 21,000 Han characters, largely limited to those in relatively common modern usage. As of version 16.0, the standard now encodes more than 97,000 Han characters, and work is continuing to add thousands more—largely historical and dialectal variant characters used throughout the [[Sinosphere]].

Modern typefaces provide a means to address some of the practical issues in depicting unified Han characters with various regional graphical representations. The 'locl' [[OpenType]] table allows a renderer to select a different glyph for each code point based on the text locale.<ref>{{Cite web |date=18 February 2023 |title=Noto CJK fonts |url=https://github.com/notofonts/noto-cjk/blob/main/Serif/README.md |publisher=Noto Fonts |quote=Select this deployment format if your system supports variable fonts and you prefer to use only one language, but also want full character coverage or the ability to language-tag text to use glyphs that are appropriate for the other languages (this requires an app that supports language tagging and the OpenType 'locl' GSUB feature).}}</ref> The [[variation Selectors|Unicode variation sequences]] can also provide in-text annotations for a desired glyph selection; this requires registration of the specific variant in the [[Ideographic Variation Database]].

==== Italic or cursive characters in Cyrillic ====
[[File:Cyrillic cursive.svg|class=skin-invert-image|thumb|right|Various [[Cyrillic]] characters shown with upright, oblique, and italic alternate forms]]
If the appropriate glyphs for characters in the same script differ only in the italic, Unicode has generally unified them, as can be seen in the comparison among a set of seven characters' italic glyphs as typically appearing in Russian, traditional Bulgarian, Macedonian, and Serbian texts at right, meaning that the differences are displayed through smart font technology or manually changing fonts. The same OpenType 'locl' technique is used.<ref>{{Cite web |last=Preuss |first=Ingo |title=OpenType Feature: locl – Localized Forms |url=https://www.preusstype.com/techdata/otf_locl.php |website=preusstype.com |language=en}}</ref>

==== Localised case pairs ====
For use in the [[Turkish alphabet]] and [[Azeri alphabet]], Unicode includes a separate [[dotless I|dotless lowercase {{serif|I}}]] (ı) and a [[İ|dotted uppercase {{serif|I}}]] ({{serif|İ}}). However, the usual ASCII letters are used for the lowercase dotted {{serif|I}} and the uppercase dotless {{serif|I}}, matching how they are handled in the earlier [[ISO 8859-9]]. As such, case-insensitive comparisons for those languages have to use different rules than case-insensitive comparisons for other languages using the Latin script.<ref>{{cite web |url=https://unicode.org/Public/UNIDATA/CaseFolding.txt |work=Unicode Character Database |title=Case Folding Properties |institution=[[Unicode Consortium]] |date=2023-05-12}}</ref><ref name="microsoft-case-insensitive-locale">{{cite web |url=https://learn.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-options#compare-using-the-invariant-culture |title=Regular expression options § Compare using the invariant culture |work=[[.NET]] fundamentals documentation |publisher=[[Microsoft]] |date=2023-05-12}}</ref> This can have security implications if, for example, [[Code injection#Preventing Code Injection|sanitization]] code or [[access control]] relies on case-insensitive comparison.<ref name="microsoft-case-insensitive-locale"/>

By contrast, the [[ð|Icelandic eth (ð)]], the [[đ|barred D (đ)]] and the [[ɖ|retroflex D (ɖ)]], which usually{{efn|Rarely, the uppercase Icelandic eth may instead be written in an [[insular script|insular]] style (Ꝺ) with the crossbar positioned on the stem, particularly if it needs to be distinguished from the uppercase retroflex D (see [[African Reference Alphabet]]).|group=note}} look the same in uppercase (Đ), are given the opposite treatment, and encoded separately in both letter-cases (in contrast to the earlier [[ISO 6937]], which unifies the uppercase forms). Although it allows for case-insensitive comparison without needing to know the language of the text, this approach also has issues, requiring security measures relating to [[homoglyph]] attacks.<ref>{{cite web |url=https://unicode.org/Public/security/latest/confusablesSummary.txt |title=confusablesSummary.txt |work=Unicode Security Mechanisms for UTS #39 |date=2023-08-11 |institution=[[Unicode Consortium]]}}</ref>

==== Diacritics on lowercase {{serif|I}} ====
[[File:I acute - soft dotted and Lithuanian dot.svg|class=skin-invert-image|thumb|right|Localised forms of the letter í ({{serif|I}} with [[acute accent]])]]
Whether the lowercase letter {{serif|I}} is expected to retain its [[tittle]] when a diacritic applies also depends on local conventions.

=== Security<span class="anchor" id="Security issues"></span> ===
Unicode has a large number of [[homoglyphs]], many of which look very similar or identical to ASCII letters. Substitution of these can make an identifier or URL that looks correct, but directs to a different location than expected.<ref>{{Cite web |title=UTR #36: Unicode Security Considerations |url=https://unicode.org/reports/tr36/ |website=Unicode}}</ref> Additionally, homoglyphs can also be used for manipulating the output of [[NLP (computer science)|natural language processing (NLP)]] systems.<ref>{{Cite book |last1=Boucher |first1=Nicholas |last2=Shumailov |first2=Ilia |last3=Anderson |first3=Ross |last4=Papernot |first4=Nicolas |title=2022 IEEE Symposium on Security and Privacy (SP) |chapter=Bad Characters: Imperceptible NLP Attacks |year=2022 |chapter-url=https://ieeexplore.ieee.org/document/9833641 |location=San Francisco, CA, US |publisher=IEEE |pages=1987–2004 |arxiv=2106.09898 |doi=10.1109/SP46214.2022.9833641 |isbn=978-1-66541-316-9 |s2cid=235485405}}</ref> Mitigation requires disallowing these characters, displaying them differently, or requiring that they resolve to the same identifier;<ref>{{Cite web |last=Engineering |first=Spotify |date=2013-06-18 |title=Creative usernames and Spotify account hijacking |url=https://engineering.atspotify.com/2013/06/creative-usernames/ |access-date=2023-04-15 |website=Spotify Engineering |language=en-US}}</ref> all of this is complicated due to the huge and constantly changing set of characters.<ref>{{cite tech report | last=Wheeler | first=David A. | title=Initial Analysis of Underhanded Source Code | year=2020 | jstor=resrep25332.7 | url=http://www.jstor.org/stable/resrep25332.7 | page=4–1–4–10}}</ref><ref>{{Cite web |title=UTR #36: Unicode Security Considerations |url=https://unicode.org/reports/tr36/ |access-date=27 June 2022 |website=Unicode}}</ref>

A security advisory was released in 2021 by two researchers, one from the [[University of Cambridge]] and the other from the [[University of Edinburgh]], in which they assert that the [[Bidirectional Text#Bidirectional text#Explicit formatting|BiDi marks]] can be used to make large sections of code do something different from what they appear to do. The problem was named "[[Trojan Source]]".<ref>{{Cite web |first1=Nicholas |last1=Boucher |first2=Ross |last2=Anderson |title=Trojan Source: Invisible Vulnerabilities |url=https://www.trojansource.codes/trojan-source.pdf |access-date=2 November 2021}}</ref> In response, code editors started highlighting marks to indicate forced text-direction changes.<ref>{{Cite web |title=Visual Studio Code October 2021 |url=https://code.visualstudio.com/updates/v1_62#_unicode-directional-formatting-characters |access-date=11 November 2021 |website=code.visualstudio.com |language=en}}</ref>

The [[UTF-8]] and [[UTF-16]] encodings do not accept all possible sequences of code units. Implementations vary in what they do when reading an invalid sequence, which has led to security bugs.<ref>{{Cite web |first1=Dominique|last1= Dittert |title=From Unicode to Exploit: The Security Risks of Overlong UTF-8 Encodings |date= 6 September 2024 |url=https://herolab.usd.de/en/the-security-risks-of-overlong-utf-8-encodings/ |access-date=26 December 2024}}</ref><ref>{{Cite web |first1=Kevin|last1= Boone |title= UTF-8 and the problem of over-long characters|url= https://kevinboone.me/overlong.html |access-date=26 December 2024}}</ref>

=== Mapping to legacy character sets ===
Unicode was designed to provide code-point-by-code-point [[round-trip format conversion]] to and from any preexisting character encodings, so that text files in older character sets can be converted to Unicode and then back and get back the same file, without employing context-dependent interpretation. That has meant that inconsistent legacy architectures, such as [[combining character|combining diacritics]] and [[precomposed character]]s, both exist in Unicode, giving more than one method of representing some text. This is most pronounced in the three different encoding forms for Korean [[Hangul]]. Since version 3.0, any precomposed characters that can be represented by a combined sequence of already existing characters can no longer be added to the standard to preserve interoperability between software using different versions of Unicode.

[[Injective]] mappings must be provided between characters in existing legacy character sets and characters in Unicode to facilitate conversion to Unicode and allow interoperability with legacy software. Lack of consistency in various mappings between earlier Japanese encodings such as [[Shift-JIS]] or [[EUC-JP]] and Unicode led to [[round-trip format conversion]] mismatches, particularly the mapping of the character JIS X 0208 '～' (1-33, WAVE DASH), heavily used in legacy database data, to either {{unichar|FF5E|FULLWIDTH TILDE}} (in [[Microsoft Windows]]) or {{unichar|301C|WAVE DASH}} (other vendors).<ref>[http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2166.doc AFII contribution about WAVE DASH], {{Cite web |date=22 April 2011 |title=An Unicode vendor-specific character table for japanese |url=http://www.ingrid.org/java/i18n/unicode.html |archive-url=https://web.archive.org/web/20110422181018/http://www.ingrid.org/java/i18n/unicode.html |archive-date=22 April 2011 |access-date=2019-05-20 }}</ref>

Some Japanese computer programmers objected to Unicode because it requires them to separate the use of {{unichar|005C|REVERSE SOLIDUS|note=backslash}} and {{unichar|00A5|YEN SIGN}}, which was mapped to 0x5C in JIS X 0201, and a lot of legacy code exists with this usage.<ref>[https://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html#s-646problem ''ISO 646-* Problem''], Section 4.4.3.5 of ''Introduction to I18n'', Tomohiro Kubota, 2001</ref> (This encoding also replaces tilde '~' 0x7E with macron '¯', now 0xAF.) The separation of these characters exists in [[ISO 8859-1]], from long before Unicode.

=== Indic scripts ===
{{further|Tamil All Character Encoding}}
[[Indic script]]s such as [[Tamil script|Tamil]] and [[Devanagari]] are each allocated only 128 code points, matching the [[ISCII]] standard. The correct rendering of Unicode Indic text requires transforming the stored logical order characters into visual order and the forming of ligatures (also known as conjuncts) out of components. Some local scholars argued in favor of assignments of Unicode code points to these ligatures, going against the practice for other writing systems, though Unicode contains some Arabic and other ligatures for backward compatibility purposes only.<ref>{{Cite web |title=Arabic Presentation Forms-A |url=https://www.unicode.org/charts/PDF/UFB50.pdf |access-date=20 March 2010}}</ref><ref>{{Cite web |title=Arabic Presentation Forms-B |url=https://www.unicode.org/charts/PDF/UFE70.pdf |access-date=20 March 2010}}</ref><ref>{{Cite web |title=Alphabetic Presentation Forms |url=https://www.unicode.org/charts/PDF/UFB00.pdf |access-date=20 March 2010}}</ref> Encoding of any new ligatures in Unicode will not happen, in part, because the set of ligatures is font-dependent, and Unicode is an encoding independent of font variations. The same kind of issue arose for the [[Tibetan script]] in 2003 when the [[Standardization Administration of China]] proposed encoding 956 precomposed Tibetan syllables,<ref>{{Cite web |date=2 December 2002 |title=Proposal on Tibetan BrdaRten Characters Encoding for ISO/IEC 10646 in BMP |url=https://www.unicode.org/L2/L2002/02455-n2558-tibetan.pdf}}</ref> but these were rejected for encoding by the relevant ISO committee ([[ISO/IEC JTC 1/SC 2]]).<ref>{{Cite web |first1=V. S. |last1=Umamaheswaran |date=7 November 2003 |title=Resolutions of WG 2 meeting 44 |url=https://www.unicode.org/L2/L2003/03390r-n2654.pdf |at=Resolution M44.20}}</ref>

[[Thai alphabet]] support has been criticized for its ordering of Thai characters. The vowels เ, แ, โ, ใ, ไ that are written to the left of the preceding consonant are in visual order instead of phonetic order, unlike the Unicode representations of other Indic scripts. This complication is due to Unicode inheriting the [[TIS-620|Thai Industrial Standard 620]], which worked in the same way, and was the way in which Thai had always been written on keyboards. This ordering problem complicates the Unicode collation process slightly, requiring table lookups to reorder Thai characters for collation.<ref name="dw2001" /> Even if Unicode had adopted encoding according to spoken order, it would still be problematic to collate words in dictionary order. E.g., the word {{Wikt-lang|th|แสดง}} {{IPA|th|sa dɛːŋ|}} "perform" starts with a consonant cluster "สด" (with an inherent vowel for the consonant "ส"), the vowel แ-, in spoken order would come after the ด, but in a dictionary, the word is collated as it is written, with the vowel following the ส.

=== Combining characters ===
{{Main|Combining character}}
{{See also|Unicode normalization#Normalization}}

Characters with diacritical marks can generally be represented either as a single precomposed character or as a decomposed sequence of a base letter plus one or more non-spacing marks. For example, ḗ (precomposed e with macron and acute above) and ē&#769; (e followed by the combining macron above and combining acute above) should be rendered identically, both appearing as an [[e]] with a [[Macron (diacritic)|macron]] (◌̄) and [[acute accent]] (◌&#769;), but in practice, their appearance may vary depending upon what rendering engine and fonts are being used to display the characters. Similarly, [[dot (diacritic)|underdots]], as needed in the [[romanization]] of [[Indo-Aryan languages|Indic languages]], will often be placed incorrectly.{{Citation needed|date=July 2011}} Unicode characters that map to precomposed glyphs can be used in many cases, thus avoiding the problem, but where no precomposed character has been encoded, the problem can often be solved by using a specialist Unicode font such as [[Charis SIL]] that uses [[Graphite (SIL)|Graphite]], [[OpenType]] ('gsub'), or [[Apple Advanced Typography|AAT]] technologies for advanced rendering features.

=== Anomalies ===
{{Main|Unicode alias names and abbreviations}}

''The Unicode Standard'' has imposed rules intended to guarantee stability.<ref>{{Cite web|url=https://www.unicode.org/policies/stability_policy.html|title=Character Encoding Stability|website=Unicode |url-status=live |archive-url=https://web.archive.org/web/20240101053402/https://www.unicode.org/policies/stability_policy.html |archive-date= Jan 1, 2024 }}</ref> Depending on the strictness of a rule, a change can be prohibited or allowed. For example, a "name" given to a code point cannot and will not change. But a "script" property is more flexible, by Unicode's own rules. In version 2.0, Unicode changed many code point "names" from version 1. At the same moment, Unicode stated that, thenceforth, an assigned name to a code point would never change. This implies that when mistakes are published, these mistakes cannot be corrected, even if they are trivial (as happened in one instance with the spelling {{sc2|{{typo|BRAKCET}}}} for {{sc2|BRACKET}} in a character name). In 2006 a list of anomalies in character names was first published, and, as of June 2021, there were 104 characters with identified issues,<ref name="tn27">{{Cite web |date=14 June 2021 |title=Unicode Technical Note #27: Known Anomalies in Unicode Character Names |url=https://unicode.org/notes/tn27/ |website=Unicode}}</ref> for example:

* {{unichar|034F|COMBINING GRAPHEME JOINER|nlink=Combining grapheme joiner}}: Does not join graphemes.<ref name="tn27" />
* {{unichar|2118|script capital p|nlink=Weierstrass p}}: This is a small letter. The capital is {{unichar|1D4AB|MATHEMATICAL SCRIPT CAPITAL P}}.<ref>{{Cite web|url=https://www.unicode.org/charts/PDF/U2100.pdf|title=Unicode chart: "actually this has the form of a lowercase calligraphic p, despite its name"}}</ref>
* {{unichar|A015|YI SYLLABLE WU|nlink=Yi language}}: This is not a Yi syllable, but a Yi iteration mark.
* {{unichar|FE18|PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR {{typo|BRAKCET}}}}: ''bracket'' is spelled incorrectly.<ref>{{Cite web|url=https://www.unicode.org/charts/PDF/UFE10.pdf|title=Misspelling of BRACKET in character name is a known defect}}</ref> (Spelling errors are resolved by using [[Unicode alias names and abbreviations|Unicode alias names]].)

While Unicode defines the script designator (name) to be "{{tt|[[ʼPhags-pa script|Phags_Pa]]}}", in that script's character names, a hyphen is added: {{Unichar|A840|PHAGS-PA LETTER KA}}.<ref name=USA24>{{Cite web |year=2021 |title=Unicode Standard Annex #24: Unicode Script Property |url=https://www.unicode.org/reports/tr24/ |access-date=29 April 2022 |publisher=The Unicode Consortium |at=2.2 Relation to ISO 15924 Codes}}</ref><ref>{{Cite web |year=2023 |title=Scripts-15.1.0.txt |url=https://www.unicode.org/Public/UNIDATA/Scripts.txt |access-date=12 September 2023 |publisher=The Unicode Consortium}}</ref> This, however, is not an anomaly, but the rule: hyphens are replaced by underscores in script designators.<ref name=USA24 />