Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Han unification
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Merger of all equivalent characters=== There has not been any push for full semantic unification of all semantically linked characters, though the idea would treat the respective users of East Asian languages the same, whether they write in Korean, Simplified Chinese, Traditional Chinese, [[Kyūjitai]] Japanese, [[Shinjitai]] Japanese or Vietnamese. Instead of some variants getting distinct code points while other groups of variants have to share single code points, all variants could be reliably expressed only with metadata tags (e.g., CSS formatting in webpages). The burden would be on all those who use differing versions of {{Lang|zh|直}}, {{Lang|zh|別}}, {{Lang|zh|兩}}, {{Lang|zh|兔}}, whether that difference be due to simplification, international variance or intra-national variance. However, for some platforms (e.g., smartphones), a device may come with only one font pre-installed. The system font must make a decision for the default glyph for each code point and these glyphs can differ greatly, indicating different underlying graphemes. Consequently, relying on language markup across the board as an approach is beset with two major issues. First, there are contexts where language markup is not available (code commits, plain text). Second, any solution would require every operating system to come pre-installed with many glyphs for semantically identical characters that have many variants. In addition to the standard character sets in Simplified Chinese, Traditional Chinese, Korean, Vietnamese, Kyūjitai Japanese and Shinjitai Japanese, there also exist "ancient" forms of characters that are of interest to historians, linguists and philologists. Unicode's Unihan database has already drawn connections between many characters. The Unicode database catalogs the connections between variant characters with distinct code points already. However, for characters with a shared code point, the reference glyph image is usually biased toward the Traditional Chinese version. Also, the decision of whether to classify pairs as semantic variants or [[z-variant]]s is not always consistent or clear, despite rationalizations in the handbook.<ref name="uax38">{{cite web|url=https://www.unicode.org/reports/tr38/|title=UAX #38: Unicode Han Database (Unihan)|website=www.unicode.org}}</ref> So-called semantic variants of {{Lang|zh-Hant|丟}} (U+4E1F) and {{Lang|zh-Hans|丢}} (U+4E22) are examples that Unicode gives as differing in a significant way in their abstract shapes, while Unicode lists {{Lang|zh|佛}} and {{Lang|ja|仏}} as z-variants, differing only in font styling. Paradoxically, Unicode considers {{Lang|zh-Hant|兩}} and {{Lang|ja|両}} to be near identical z-variants while at the same time classifying them as significantly different semantic variants. There are also cases of some pairs of characters being simultaneously semantic variants and specialized semantic variants and simplified variants: {{Lang|zh-Hant|個}} (U+500B) and {{Lang|zh-Hans|个}} (U+4E2A). There are cases of non-mutual equivalence. For example, the Unihan database entry for {{Lang|ja|亀}} (U+4E80) considers {{Lang|zh-Hant|龜}} (U+9F9C) to be its z-variant, but the entry for {{Lang|zh-Hant|龜}} does not list {{Lang|ja|亀}} as a z-variant, even though {{Lang|zh-Hant|龜}} was obviously already in the database at the time that the entry for {{Lang|ja|亀}} was written. Some clerical errors led to doubling of completely identical characters such as {{Lang|zh|﨣}} (U+FA23) and {{Lang|zh|𧺯}} (U+27EAF). If a font has glyphs encoded to both points so that one font is used for both, they should appear identical. These cases are listed as z-variants despite having no variance at all. Intentionally duplicated characters were added to facilitate [[Round-trip format conversion|bit-for-bit round-trip conversion]]. Because round-trip conversion was an early selling point of Unicode, this meant that if a national standard in use unnecessarily duplicated a character, Unicode had to do the same. Unicode calls these intentional duplications "[[Unicode compatibility characters|compatibility variants]]" as with 漢 (U+FA9A) which calls {{Lang|zh|漢}} (U+6F22) its compatibility variant. As long as an application uses the same font for both, they should appear identical. Sometimes, as in the case of {{Lang|zh|車}} with U+8ECA and U+F902, the added compatibility character lists the already present version of {{Lang|zh|車}} as both its compatibility variant and its z-variant. The compatibility variant field overrides the z-variant field, forcing normalization under all forms, including canonical equivalence. Despite the name, compatibility variants are actually canonically equivalent and are united in any Unicode normalization scheme and not only under compatibility normalization. This is similar to how {{unichar|212B|ANGSTROM SIGN}} is canonically equivalent to a pre-composed {{unichar|00C5|LATIN CAPITAL LETTER A WITH RING ABOVE}}. Much software (such as the MediaWiki software that hosts Wikipedia) will replace all canonically equivalent characters that are discouraged (e.g. the angstrom symbol) with the recommended equivalent. Despite the name, CJK "compatibility variants" are canonically equivalent characters and not compatibility characters. 漢 (U+FA9A) was added to the database later than {{Lang|zh|漢}} (U+6F22) was and its entry informs the user of the compatibility information. On the other hand, {{Lang|zh|漢}} (U+6F22) does not have this equivalence listed in this entry. Unicode demands that all entries, once admitted, cannot change compatibility or equivalence so that normalization rules for already existing characters do not change. Some pairs of Traditional and Simplified are also considered to be semantic variants. According to Unicode's definitions, it makes sense that all simplifications (that do not result in wholly different characters being merged for their homophony) will be a form of semantic variant. Unicode classifies {{Lang|zh-Hant|丟}} and {{Lang|zh-Hans|丢}} as each other's respective traditional and simplified variants and also as each other's semantic variants. However, while Unicode classifies {{Lang|zh-Hant|億}} (U+5104) and {{Lang|zh-Hans|亿}} (U+4EBF) as each other's respective traditional and simplified variants, Unicode does not consider {{Lang|zh-Hant|億}} and {{Lang|zh-Hant|亿}} to be semantic variants of each other. Unicode claims that "Ideally, there would be no pairs of z-variants in the Unicode Standard."<ref name="uax38"/> This would make it seem that the goal is to at least unify all minor variants, compatibility redundancies and accidental redundancies, leaving the differentiation to fonts and to language tags. This conflicts with the stated goal of Unicode to take away that overhead, and to allow any number of any of the world's scripts to be on the same document with one encoding system.{{synthesis inline|date=September 2018}} Chapter One of the handbook states that "With Unicode, the information technology industry has replaced proliferating character sets with data stability, global interoperability and data interchange, simplified software, and reduced development costs. While taking the ASCII character set as its starting point, the Unicode Standard goes far beyond ASCII's limited ability to encode only the upper- and lowercase letters A through Z. It provides the capacity to encode all characters used for the written languages of the world – more than 1 million characters can be encoded. No escape sequence or control code is required to specify any character in any language. The Unicode character encoding treats alphabetic characters, ideographic characters, and symbols equivalently, which means they can be used in any mixture and with equal facility."<ref name="tusch01"/> This leaves the option to settle on one unified reference grapheme for all z-variants, which is contentious since few outside of Japan would recognize {{Lang|ja|佛}} and {{Lang|ja|仏}} as equivalent. Even within Japan, the variants are on different sides of a major simplification called Shinjitai. Unicode would effectively make the PRC's simplification of {{Lang|zh-Hans|侣}} (U+4FA3) and {{Lang|zh-Hant|侶}} (U+4FB6) a monumental difference by comparison. Such a plan would also eliminate the very visually distinct variations for characters like {{Lang|zh|直}} (U+76F4) and {{Lang|zh|雇}} (U+96C7). One would expect that all simplified characters would simultaneously also be z-variants or semantic variants with their traditional counterparts, but many are neither. It is easier to explain the strange case that semantic variants can be simultaneously both semantic variants and specialized variants when Unicode's definition is that specialized semantic variants have the same meaning only in certain contexts. Languages use them differently. A pair whose characters are 100% drop-in replacements for each other in Japanese may not be so flexible in Chinese. Thus, any comprehensive merger of recommended code points would have to maintain some variants that differ only slightly in appearance even if the meaning is 100% the same for all contexts in one language, because in another language the two characters may not be 100% drop-in replacements.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)