Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Han unification
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Graphemes versus glyphs=== {{main|Allograph}} [[File:LatinAgraphemeVariations.svg|thumb|right|The Latin lowercase "[[a]]" has widely differing glyphs that all represent concrete instances of the same abstract grapheme. Although a native reader of any language using the Latin script recognizes these two glyphs as the same grapheme, to others they might appear to be completely unrelated.]] A grapheme is the smallest abstract unit of meaning in a writing system. Any grapheme has many possible glyph expressions, but all are recognized as the same grapheme by those with reading and writing knowledge of a particular writing system. Although Unicode typically assigns characters to code points to express the graphemes within a system of writing, the Unicode Standard ([https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G2212 section 3.4 D7]) cautions: {{blockquote | An abstract character does not necessarily correspond to what a user thinks of as a "character" and should not be confused with a ''grapheme''. |source= [https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G2212 The Unicode® Standard Version 15.0 – Core Specification §3.4 Characters and Encoding]}} However, this quote refers to the fact that some graphemes are composed of several graphic elements or "characters". So, for example, the character {{unichar|0061|LATIN SMALL LETTER A}} combined with {{unichar|030A|COMBINING RING ABOVE|cwith=◌}} (generating the combination "å") might be understood by a user as a single grapheme while being composed of multiple Unicode abstract characters. In addition, Unicode also assigns some code points to a small number (other than for compatibility reasons) of formatting characters, whitespace characters, and other abstract characters that are not graphemes, but instead used to control the breaks between lines, words, graphemes and grapheme clusters. With the unified Han ideographs, the Unicode Standard makes a departure from prior practices in assigning abstract characters not as graphemes, but according to the underlying meaning of the grapheme: what linguists sometimes call [[sememe]]s. This departure therefore is not simply explained by the oft quoted distinction between an abstract character and a glyph, but is more rooted in the difference between an abstract character assigned as a grapheme and an abstract character assigned as a sememe. In contrast, consider [[ASCII]]'s unification of [[punctuation]] and [[diacritic]]s, where graphemes with widely different meanings (for example, an [[apostrophe]] and a single quotation mark) are unified because the glyphs are the same. For Unihan the characters are not unified by their appearance, but by their definition or meaning. For a grapheme to be represented by various glyphs means that the grapheme has glyph variations that are usually determined by selecting one font or another or using glyph substitution features where multiple glyphs are included in a single font. Such glyph variations are considered by Unicode a feature of rich text protocols and not properly handled by the plain text goals of Unicode. However, when the change from one glyph to another constitutes a change from one grapheme to another—where a glyph cannot possibly still, for example, mean the same grapheme understood as the small letter "a"—Unicode separates those into separate code points. For Unihan the same thing is done whenever the abstract meaning changes, however rather than speaking of the abstract meaning of a grapheme (the letter "a"), the unification of Han ideographs assigns a new code point for each different meaning—even if that meaning is expressed by distinct graphemes in different languages. Although a grapheme such as "ö" might mean something different in English (as used in the word "coördinated") than it does in German (as used in the word "schön"), it is still the same grapheme and can be easily unified so that English and German can share a common abstract Latin writing system (along with Latin itself). This example also points to another reason that "abstract character" and grapheme as an abstract unit in a written language do not necessarily map one-to-one. In English the [[Diaeresis (diacritic)|combining diaeresis]], "¨", and the "o" it modifies may be seen as two separate graphemes, whereas in languages such as Swedish, the letter "ö" may be seen as a single grapheme. Similarly in English [[tittle|the dot]] on an "i" is understood as a part of the "i" grapheme whereas in other languages, such as Turkish, the dot may be seen as a separate grapheme added to the [[Dotless I|dotless "ı"]]. To deal with the use of different graphemes for the same Unihan sememe, Unicode has relied on several mechanisms: especially as it relates to rendering text. One has been to treat it as simply a font issue so that different fonts might be used to render Chinese, Japanese or Korean. Also font formats such as OpenType allow for the mapping of alternate glyphs according to language so that a text rendering system can look to the user's environmental settings to determine which glyph to use. The problem with these approaches is that they fail to meet the goals of Unicode to define a consistent way of encoding multilingual text.<ref name="tusch01">{{cite web|url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-1|title=Chapter 1: Introduction|work=The Unicode Standard|publisher=Unicode Consortium}}</ref> So rather than treat the issue as a rich text problem of glyph alternates, Unicode added the concept of [[Variation Selectors (Unicode block)|variation selectors]], first introduced in version 3.2 and supplemented in version 4.0.<ref name="UnicodeVariationSelectors">{{Cite web|url=https://www.unicode.org/ivd/|title=Ideographic Variation Database|publisher=Unicode Consortium}}</ref> While variation selectors are treated as combining characters, they have no associated diacritic or mark. Instead, by combining with a base character, they signal the two character sequence selects a variation (typically in terms of grapheme, but also in terms of underlying meaning as in the case of a location name or other proper noun) of the base character. This then is not a selection of an alternate glyph, but the selection of a grapheme variation or a variation of the base abstract character. Such a two-character sequence however can be easily mapped to a separate single glyph in modern fonts. Since Unicode has assigned 256 separate variation selectors, it is capable of assigning 256 variations for any Han ideograph. Such variations can be specific to one language or another and enable the encoding of plain text that includes such grapheme variations.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)