Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Unicode
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== General Category property === Each code point is assigned a classification, listed as the code point's [[Character property (Unicode)#General Category|General Category]] property. Here, at the uppermost level code points are categorized as one of Letter, Mark, Number, Punctuation, Symbol, Separator, or Other. Under each category, each code point is then further subcategorized. In most cases, other properties must be used to adequately describe all the characteristics of any given code point. {{General Category (Unicode)}} The {{val|1024}} points in the range {{tt|U+D800}}β{{tt|U+DBFF}} are known as ''high-surrogate'' code points, and code points in the range {{tt|U+DC00}}β{{tt|U+DFFF}} ({{val|1024}} code points) are known as ''low-surrogate'' code points. A high-surrogate code point followed by a low-surrogate code point forms a ''surrogate pair'' in UTF-16 in order to represent code points greater than {{tt|U+FFFF}}. In principle, these code points cannot otherwise be used, though in practice this rule is often ignored, especially when not using UTF-16. A small set of code points are guaranteed never to be assigned to characters, although third-parties may make independent use of them at their discretion. There are 66 of these ''noncharacters'': {{tt|U+FDD0}}β{{tt|U+FDEF}} and the last two code points in each of the 17 planes (e.g. {{tt|U+FFFE}}, {{tt|U+FFFF}}, {{tt|U+1FFFE}}, {{tt|U+1FFFF}}, ..., {{tt|U+10FFFE}}, {{Tt|U+10FFFF}}). The set of noncharacters is stable, and no new noncharacters will ever be defined.<ref name="stability-policy">{{Cite web |title=Unicode Character Encoding Stability Policy |url=https://unicode.org/policies/stability_policy.html |access-date=16 March 2010}}</ref> Like surrogates, the rule that these cannot be used is often ignored, although the operation of the [[byte order mark]] assumes that {{tt|U+FFFE}} will never be the first code point in a text. The exclusion of surrogates and noncharacters leaves {{val|1111998}} code points available for use. ''Private use'' code points are considered to be assigned, but they intentionally have no interpretation specified by ''The Unicode Standard''<ref>{{Cite web |title=Properties |url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G43463 |access-date=13 September 2024}}</ref> such that any interchange of such code points requires an independent agreement between the sender and receiver as to their interpretation. There are three private use areas in the Unicode codespace: * Private Use Area: {{tt|U+E000}}β{{tt|U+F8FF}} ({{val|6400}} characters), * Supplementary Private Use Area-A: {{tt|U+F0000}}β{{tt|U+FFFFD}} ({{val|65534}} characters), * Supplementary Private Use Area-B: {{tt|U+100000}}β{{tt|U+10FFFD}} ({{val|65534}} characters). ''Graphic'' characters are those defined by ''The Unicode Standard'' to have particular semantics, either having a visible [[glyph]] shape or representing a visible space. As of Unicode 16.0, there are {{val|154826}} graphic characters. ''Format'' characters are characters that do not have a visible appearance but may have an effect on the appearance or behavior of neighboring characters. For example, {{unichar|200C|Zero width non-joiner|nlink=}} and {{unichar|200D|Zero width joiner|nlink=}} may be used to change the default shaping behavior of adjacent characters (e.g. to inhibit ligatures or request ligature formation). There are 172 format characters in Unicode 16.0. 65 code points, the ranges {{tt|U+0000}}β{{tt|U+001F}} and {{tt|U+007F}}β{{tt|U+009F}}, are reserved as ''control codes'', corresponding to the [[C0 and C1 control codes]] as defined in [[ISO/IEC 6429]]. {{tt|U+0089}} {{smallcaps|LINE TABULATION}}, {{tt|U+008A}} {{smallcaps|LINE FEED}}, and {{tt|U+000D}} {{smallcaps|CARRIAGE RETURN}} are widely used in texts using Unicode. In a phenomenon known as [[mojibake]], the C1 code points are improperly decoded according to the [[Windows-1252]] codepage, previously widely used in Western European contexts. Together, graphic, format, control code, and private use characters are collectively referred to as ''assigned characters''. ''Reserved'' code points are those code points that are valid and available for use, but have not yet been assigned. As of Unicode 15.1, there are {{val|819467}} reserved code points.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)