Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Unicode
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Architecture and terminology == {{See also|Universal Character Set characters}}{{Anchor|Upluslink}}<!-- Template:U+ links to this paragraph --> === Codespace and code points === ''The Unicode Standard'' defines a ''codespace'':<ref name="Glossary">{{Cite web |title=Glossary of Unicode Terms |url=https://unicode.org/glossary/ |access-date=16 March 2010}}</ref> a sequence of integers called ''[[code point]]s''<ref name=":0">{{Cite book |url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-2/#G25564 |title=The Unicode Standard Version 16.0 – Core Specification |year=2024 |chapter=2.4 Code Points and Characters}}</ref> in the range from 0 to {{val|1114111}}, notated according to the standard as {{tt|U+0000}}–{{tt|U+10FFFF}}.<ref>{{Cite book |url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G2212 |title=The Unicode Standard, Version 16.0 |year=2024 |chapter=3.4 Characters and Encoding}}</ref> The codespace is a systematic, architecture-independent representation of ''The Unicode Standard''; actual text is processed as binary data via one of several Unicode encodings, such as [[UTF-8]]. In this normative notation, the two-character prefix <code>U+</code> always precedes a written code point,<ref>{{Cite mailing list |url=https://unicode.org/mail-arch/unicode-ml/y2005-m11/0060.html |title=Re: Origin of the U+nnnn notation |date=8 November 2005 |mailing-list=Unicode Mail List Archive}}</ref> and the code points themselves are written as [[hexadecimal]] numbers. At least four hexadecimal digits are always written, with [[leading zero]]s prepended as needed. For example, the code point {{unichar|F7|Division sign}} is padded with two leading zeros, but {{unichar|13254|Egyptian hieroglyph O004}} ([[File:Hiero O4.png|class=skin-invert-image|text-bottom|15px]]) is not padded.<ref>{{Cite web |date=September 2024 |title=Appendix A: Notational Conventions |url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/appendix-a/ |website=The Unicode Standard |publisher=Unicode Consortium}}</ref> There are a total of {{val|1112064}} valid code points within the codespace.<ref>{{cite book |title=The Unicode Standard |publisher=[[The Unicode Consortium]] |isbn=978-1-936213-01-6 |edition=6.0 |location=Mountain View, California, US |at=3.9 Unicode Encoding Forms |chapter=Conformance |quote=Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF |chapter-url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G7404}}</ref> This number arises from the limitations of the [[UTF-16]] character encoding, which can encode the 2<sup>16</sup> code points in the range {{tt|U+0000}} through {{tt|U+FFFF}} except for the 2<sup>11</sup> code points in the range {{tt|U+D800}} through {{tt|U+DFFF}}, which are used as surrogate pairs to encode the 2<sup>20</sup> code points in the range {{tt|U+10000}} through {{tt|U+10FFFF}}. === Code planes and blocks === {{Main|Plane (Unicode)}} The Unicode codespace is divided into 17 ''planes'', numbered 0 to 16. Plane 0 is the [[Basic Multilingual Plane]] (BMP), and contains the most commonly used characters. All code points in the BMP are accessed as a single code unit in UTF-16 encoding and can be encoded in one, two or three bytes in UTF-8. Code points in planes 1 through 16 (the ''supplementary planes'') are accessed as surrogate pairs in [[UTF-16]] and encoded in four bytes in [[UTF-8]]. Within each plane, characters are allocated within named ''[[Block (Unicode)|blocks]]'' of related characters. The size of a block is always a multiple of 16, and is often a multiple of 128, but is otherwise arbitrary. Characters required for a given script may be spread out over several different, potentially disjunct blocks within the codespace. === General Category property === Each code point is assigned a classification, listed as the code point's [[Character property (Unicode)#General Category|General Category]] property. Here, at the uppermost level code points are categorized as one of Letter, Mark, Number, Punctuation, Symbol, Separator, or Other. Under each category, each code point is then further subcategorized. In most cases, other properties must be used to adequately describe all the characteristics of any given code point. {{General Category (Unicode)}} The {{val|1024}} points in the range {{tt|U+D800}}–{{tt|U+DBFF}} are known as ''high-surrogate'' code points, and code points in the range {{tt|U+DC00}}–{{tt|U+DFFF}} ({{val|1024}} code points) are known as ''low-surrogate'' code points. A high-surrogate code point followed by a low-surrogate code point forms a ''surrogate pair'' in UTF-16 in order to represent code points greater than {{tt|U+FFFF}}. In principle, these code points cannot otherwise be used, though in practice this rule is often ignored, especially when not using UTF-16. A small set of code points are guaranteed never to be assigned to characters, although third-parties may make independent use of them at their discretion. There are 66 of these ''noncharacters'': {{tt|U+FDD0}}–{{tt|U+FDEF}} and the last two code points in each of the 17 planes (e.g. {{tt|U+FFFE}}, {{tt|U+FFFF}}, {{tt|U+1FFFE}}, {{tt|U+1FFFF}}, ..., {{tt|U+10FFFE}}, {{Tt|U+10FFFF}}). The set of noncharacters is stable, and no new noncharacters will ever be defined.<ref name="stability-policy">{{Cite web |title=Unicode Character Encoding Stability Policy |url=https://unicode.org/policies/stability_policy.html |access-date=16 March 2010}}</ref> Like surrogates, the rule that these cannot be used is often ignored, although the operation of the [[byte order mark]] assumes that {{tt|U+FFFE}} will never be the first code point in a text. The exclusion of surrogates and noncharacters leaves {{val|1111998}} code points available for use. ''Private use'' code points are considered to be assigned, but they intentionally have no interpretation specified by ''The Unicode Standard''<ref>{{Cite web |title=Properties |url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G43463 |access-date=13 September 2024}}</ref> such that any interchange of such code points requires an independent agreement between the sender and receiver as to their interpretation. There are three private use areas in the Unicode codespace: * Private Use Area: {{tt|U+E000}}–{{tt|U+F8FF}} ({{val|6400}} characters), * Supplementary Private Use Area-A: {{tt|U+F0000}}–{{tt|U+FFFFD}} ({{val|65534}} characters), * Supplementary Private Use Area-B: {{tt|U+100000}}–{{tt|U+10FFFD}} ({{val|65534}} characters). ''Graphic'' characters are those defined by ''The Unicode Standard'' to have particular semantics, either having a visible [[glyph]] shape or representing a visible space. As of Unicode 16.0, there are {{val|154826}} graphic characters. ''Format'' characters are characters that do not have a visible appearance but may have an effect on the appearance or behavior of neighboring characters. For example, {{unichar|200C|Zero width non-joiner|nlink=}} and {{unichar|200D|Zero width joiner|nlink=}} may be used to change the default shaping behavior of adjacent characters (e.g. to inhibit ligatures or request ligature formation). There are 172 format characters in Unicode 16.0. 65 code points, the ranges {{tt|U+0000}}–{{tt|U+001F}} and {{tt|U+007F}}–{{tt|U+009F}}, are reserved as ''control codes'', corresponding to the [[C0 and C1 control codes]] as defined in [[ISO/IEC 6429]]. {{tt|U+0089}} {{smallcaps|LINE TABULATION}}, {{tt|U+008A}} {{smallcaps|LINE FEED}}, and {{tt|U+000D}} {{smallcaps|CARRIAGE RETURN}} are widely used in texts using Unicode. In a phenomenon known as [[mojibake]], the C1 code points are improperly decoded according to the [[Windows-1252]] codepage, previously widely used in Western European contexts. Together, graphic, format, control code, and private use characters are collectively referred to as ''assigned characters''. ''Reserved'' code points are those code points that are valid and available for use, but have not yet been assigned. As of Unicode 15.1, there are {{val|819467}} reserved code points. === Abstract characters <span class="anchor" id="Alias"></span>=== {{Further|Universal Character Set characters#Characters, grapheme clusters and glyphs}} The set of graphic and format characters defined by Unicode does not correspond directly to the repertoire of ''abstract characters'' representable under Unicode. Unicode encodes characters by associating an abstract character with a particular code point.<ref>{{Cite web |title=Unicode Character Encoding Model |url=https://unicode.org/reports/tr17/ |access-date=12 September 2023}}</ref> However, not all abstract characters are encoded as a single Unicode character, and some abstract characters may be represented in Unicode by a sequence of two or more characters. For example, a Latin small letter "i" with an [[ogonek]], a [[dot above]], and an [[acute accent]], which is required in [[Lithuanian language|Lithuanian]], is represented by the character sequence {{tt|U+012F}}; {{tt|U+0307}}; {{tt|U+0301}}. Unicode maintains a list of uniquely named character sequences for abstract characters that are not directly encoded in Unicode.<ref>{{Cite web |title=Unicode Named Sequences |url=https://unicode.org/Public/UNIDATA/NamedSequences.txt |access-date=16 September 2022}}</ref> All assigned characters have a unique and immutable name by which they are identified. This immutability has been guaranteed since version 2.0 of ''The Unicode Standard'' by its Name Stability policy.<ref name="stability-policy" /> In cases where a name is seriously defective and misleading, or has a serious typographical error, a formal '''alias''' may be defined that applications are encouraged to use in place of the official character name. For example, {{unichar|A015|YI SYLLABLE WU}} has the formal alias {{sc2|YI SYLLABLE ITERATION MARK}}, and {{unichar|FE18|PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRA'''KC'''ET|note=[[sic]]}} has the formal alias {{sc2|PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRA'''CK'''ET}}.<ref>{{Cite web |title=Unicode Name Aliases |url=https://unicode.org/Public/UNIDATA/NameAliases.txt |access-date=16 March 2010}}</ref> === Ready-made versus composite characters === Unicode includes a mechanism for modifying characters that greatly extends the supported repertoire of glyphs. This covers the use of [[combining diacritical mark]]s that may be added after the base character by the user. Multiple combining diacritics may be simultaneously applied to the same character. Unicode also contains [[precomposed character|precomposed]] versions of most letter/diacritic combinations in normal use. These make the conversion to and from legacy encodings simpler, and allow applications to use Unicode as an internal text format without having to implement combining characters. For example, <code>é</code> can be represented in Unicode as {{unichar|65|LATIN SMALL LETTER E}} followed by {{unichar|301|COMBINING ACUTE ACCENT|cwith=◌}}), and equivalently as the precomposed character {{unichar|E9|LATIN SMALL LETTER E WITH ACUTE}}. Thus, users often have multiple equivalent ways of encoding the same character. The mechanism of [[canonical equivalence]] within ''The Unicode Standard'' ensures the practical interchangeability of these equivalent encodings. An example of this arises with the Korean alphabet [[Hangul]]: Unicode provides a mechanism for composing Hangul syllables from their individual [[Hangul Jamo]] subcomponents. However, it also provides {{val|11172}} combinations of precomposed syllables made from the most common jamo. [[CJK characters]] presently only have codes for uncomposable radicals and precomposed forms. Most Han characters have either been intentionally composed from, or reconstructed as compositions of, simpler orthographic elements called [[Radical (Chinese characters)|radicals]], so in principle Unicode could have enabled their composition as it did with Hangul. While this could have greatly reduced the number of required code points, as well as allowing the algorithmic synthesis of many arbitrary new characters, the complexities of character etymologies and the post-hoc nature of radical systems add immense complexity to the proposal. Indeed, attempts to design CJK encodings on the basis of composing radicals have been met with difficulties resulting from the reality that Chinese characters do not decompose as simply or as regularly as Hangul does. The [[CJK Radicals Supplement]] block is assigned to the range {{tt|U+2E80}}–{{tt|U+2EFF}}, and the [[Kangxi radicals]] are assigned to {{tt|U+2F00}}–{{tt|U+2FDF}}. The [[Ideographic Description Sequences]] block covers the range {{tt|U+2FF0}}–{{tt|U+2FFB}}, but ''The Unicode Standard'' warns against using its characters as an alternate representation for characters encoded elsewhere: {{blockquote|This process is different from a formal ''encoding'' of an ideograph. There is no canonical description of unencoded ideographs; there is no semantic assigned to described ideographs; there is no equivalence defined for described ideographs. Conceptually, ideographic descriptions are more akin to the English phrase "an 'e' with an acute accent on it" than to the character sequence <U+0065, U+0301>.}} === Ligatures === <div class='skin-invert-image'>{{Multiple image |total_width = 300 |image1 = JanaSanskritSans ddhrya.svg |caption1 = The [[Devanagari|Devanāgarī]] ''{{IAST|ddhrya}}''-ligature (द् + ध् + र् + य = द्ध्र्य) of JanaSanskritSans<ref>{{Cite web |title=JanaSanskritSans |url=http://tdil.mit.gov.in/download/janasanskrit.htm |url-status=dead |archive-url=https://web.archive.org/web/20110716160603/http://tdil.mit.gov.in/download/janasanskrit.htm |archive-date=16 July 2011}}</ref> |image2 = 23a-Lam-Alif.svg |caption2 = The [[Arabic script in Unicode|Arabic]] {{lang|ar-Latn|[[lām]]-[[aleph#Arabic|alif]]}} ligature ({{lang|ar|ل}} ‎+‎ {{lang|ar|ا}} ‎=‎ {{lang|ar|لا}}) }}</div> Many scripts, including [[Arabic script in Unicode|Arabic]] and [[Devanagari|Devanāgarī]], have special orthographic rules that require certain combinations of letterforms to be combined into special [[ligature (typography)|ligature forms]]. The rules governing ligature formation can be quite complex, requiring special script-shaping technologies such as ACE (Arabic Calligraphic Engine by DecoType in the 1980s and used to generate all the Arabic examples in the printed editions of ''The Unicode Standard''), which became the [[proof of concept]] for [[OpenType]] (by Adobe and Microsoft), [[Graphite (SIL)|Graphite]] (by [[SIL International]]), or [[Apple Advanced Typography|AAT]] (by Apple). Instructions are also embedded in fonts to tell the operating system how to properly output different character sequences. A simple solution to the placement of combining marks or diacritics is assigning the marks a width of zero and placing the glyph itself to the left or right of the left sidebearing (depending on the direction of the script they are intended to be used with). A mark handled this way will appear over whatever character precedes it, but will not adjust its position relative to the width or height of the base glyph; it may be visually awkward and it may overlap some glyphs. Real stacking is impossible but can be approximated in limited cases (for example, Thai top-combining vowels and tone marks can just be at different heights to start with). Generally, this approach is only effective in monospaced fonts but may be used as a fallback rendering method when more complex methods fail. === Standardized subsets === Several subsets of Unicode are standardized: Microsoft Windows since [[Windows NT 4.0]] supports [[WGL-4]] with 657 characters, which is considered to support all contemporary European languages using the Latin, Greek, or Cyrillic script. Other standardized subsets of Unicode include the Multilingual European Subsets:<ref>[https://www.evertype.com/standards/iso10646/pdf/cwa13873.pdf CWA 13873:2000 – Multilingual European Subsets in ISO/IEC 10646-1] [[European Committee for Standardization|CEN]] Workshop Agreement 13873</ref> MES-1 (Latin scripts only; 335 characters), MES-2 (Latin, Greek, and Cyrillic; 1062 characters)<ref>{{Cite web |last = Kuhn |first = Markus |author-link = Markus Kuhn (computer scientist) |date = 1998 |title=Multilingual European Character Set 2 (MES-2) Rationale |url=https://www.cl.cam.ac.uk/~mgk25/ucs/mes-2-rationale.html |access-date=20 March 2023 |publisher=University of Cambridge}}</ref> and MES-3A & MES-3B (two larger subsets, not shown here). MES-2 includes every character in MES-1 and WGL-4. The standard [[DIN 91379]]<ref>{{Cite web |title=DIN 91379:2022-08: Characters and defined character sequences in Unicode for the electronic processing of names and data exchange in Europe, with CD-ROM |url=https://www.beuth.de/en/standard/din-91379/353496133 |access-date=21 August 2022 |publisher=Beuth Verlag}}</ref> specifies a subset of Unicode letters, special characters, and sequences of letters and diacritic signs to allow the correct representation of names and to simplify data exchange in Europe. This standard supports all of the official languages of all European Union countries, as well as the German minority languages and the official languages of Iceland, Liechtenstein, Norway, and Switzerland. To allow the transliteration of names in other writing systems to the Latin script according to the relevant ISO standards, all necessary combinations of base letters and diacritic signs are provided. {| class="wikitable" |+ {{nobold|'''WGL-4''', ''MES-1'' and MES-2}} |- ! Row !! Cells !! Range(s) |- !rowspan="2"| 00 | '''''20–7E''''' | [[Basic Latin (Unicode block)|Basic Latin]] (00–7F) |- | '''''A0–FF''''' | [[Latin-1 Supplement (Unicode block)|Latin-1 Supplement]] (80–FF) |- !rowspan="2"| 01 | '''''00–13,'' 14–15, ''16–2B,'' 2C–2D, ''2E–4D,'' 4E–4F, ''50–7E,'' 7F''' | [[Latin Extended-A]] (00–7F) |- | 8F, '''92,''' B7, DE-EF, '''FA–FF''' | [[Latin Extended-B]] (80–FF <span title="U+024F">...</span>) |- !rowspan="3"| 02 | 18–1B, 1E–1F | Latin Extended-B (<span title="U+00180">...</span> 00–4F) |- | 59, 7C, 92 | [[IPA Extensions]] (50–AF) |- | BB–BD, '''C6, ''C7,'' C9,''' D6, '''''D8–DB,'' DC, ''DD,''''' DF, EE | [[Spacing Modifier Letters]] (B0–FF) |- ! 03 | 74–75, 7A, 7E, '''84–8A, 8C, 8E–A1, A3–CE,''' D7, DA–E1 | [[Greek and Coptic|Greek]] (70–FF) |- ! 04 | '''00–5F, 90–91,''' 92–C4, C7–C8, CB–CC, D0–EB, EE–F5, F8–F9 | [[Cyrillic (Unicode block)|Cyrillic]] (00–FF) |- ! 1E | 02–03, 0A–0B, 1E–1F, 40–41, 56–57, 60–61, 6A–6B, '''80–85,''' 9B, '''F2–F3''' | [[Latin Extended Additional]] (00–FF) |- ! 1F | 00–15, 18–1D, 20–45, 48–4D, 50–57, 59, 5B, 5D, 5F–7D, 80–B4, B6–C4, C6–D3, D6–DB, DD–EF, F2–F4, F6–FE | [[Greek Extended]] (00–FF) |- !rowspan="3"| 20 | '''13–14, ''15,'' 17, ''18–19,'' 1A–1B, ''1C–1D,'' 1E, 20–22, 26, 30, 32–33, 39–3A, 3C, 3E, 44,''' 4A | [[General Punctuation]] (00–6F) |- | '''7F''', 82 | [[Superscripts and Subscripts]] (70–9F) |- | '''A3–A4, A7, ''AC,''''' AF | [[Currency Symbols (Unicode block)|Currency Symbols]] (A0–CF) |- !rowspan="3"| 21 | '''05, 13, 16, ''22, 26,'' 2E''' | [[Letterlike Symbols]] (00–4F) |- | '''''5B–5E''''' | [[Number Forms]] (50–8F) |- | '''''90–93,'' 94–95, A8''' | [[Arrows (Unicode block)|Arrows]] (90–FF) |- ! 22 | 00, '''02,''' 03, '''06,''' 08–09, '''0F, 11–12, 15, 19–1A, 1E–1F,''' 27–28, '''29,''' 2A, '''2B, 48,''' 59, '''60–61, 64–65,''' 82–83, 95, 97 | [[Mathematical Operators]] (00–FF) |- ! 23 | '''02, 0A, 20–21,''' 29–2A | [[Miscellaneous Technical]] (00–FF) |- !rowspan="3"| 25 | '''00, 02, 0C, 10, 14, 18, 1C, 24, 2C, 34, 3C, 50–6C''' | [[Box Drawing]] (00–7F) |- | '''80, 84, 88, 8C, 90–93''' | [[Block Elements]] (80–9F) |- | '''A0–A1, AA–AC, B2, BA, BC, C4, CA–CB, CF, D8–D9, E6''' | [[Geometric Shapes (Unicode block)|Geometric Shapes]] (A0–FF) |- ! 26 | '''3A–3C, 40, 42, 60, 63, 65–66, ''6A,'' 6B''' | [[Miscellaneous Symbols]] (00–FF) |- ! F0 | (01–02)<!--in WGL-4, but not in MES-2--> | [[Private Use Area (Unicode block)|Private Use Area]] (00–FF ...) |- ! FB | '''01–02''' | [[Alphabetic Presentation Forms]] (00–4F) |- ! FF | FD | [[Specials (Unicode block)|Specials]] |} Rendering software that cannot process a Unicode character appropriately often displays it as an open rectangle, or as {{tt|U+FFFD}} to indicate the position of the unrecognized character. Some systems have made attempts to provide more information about such characters. Apple's [[Last Resort font]] will display a substitute glyph indicating the Unicode range of the character, and the [[SIL International]]'s [[Unicode fallback font]] will display a box showing the hexadecimal scalar value of the character. === {{anchor|UTF|UCS}}Mapping and encodings === Several mechanisms have been specified for storing a series of code points as a series of bytes. <!-- [[Unicode Transformation Format]] redirects here --> Unicode defines two mapping methods: the '''Unicode Transformation Format''' (UTF) encodings, and the '''[[Universal Coded Character Set]]''' (UCS) encodings. An encoding maps (possibly a subset of) the range of Unicode ''code points'' to sequences of values in some fixed-size range, termed ''code units''. All UTF encodings map code points to a unique sequence of bytes.<ref>{{Cite web |title=UTF-8, UTF-16, UTF-32 & BOM |url=https://unicode.org/faq/utf_bom.html |access-date=12 December 2016 |website=Unicode.org FAQ}}</ref> The numbers in the names of the encodings indicate the number of bits per code unit (for UTF encodings) or the number of bytes per code unit (for UCS encodings and [[UTF-1]]). UTF-8 and UTF-16 are the most commonly used encodings. [[Universal Coded Character Set|UCS-2]] is an obsolete subset of UTF-16; UCS-4 and UTF-32 are functionally equivalent. UTF encodings include: * [[UTF-8]], which uses one to four 8-bit units per [[code point]],<ref group=note>a [[code point]] is an abstract representation of an UCS character by an integer between 0 and 1,114,111 (1,114,112 = 2<sup>20</sup> + 2<sup>16</sup> or 17 × 2<sup>16</sup> = 0x110000 code points)</ref> and has maximal compatibility with [[ASCII]] * [[UTF-16]], which uses one 16-bit unit per code point below {{tt|U+010000}}, and a [[Universal Character Set characters#Surrogates|surrogate pair]] of two 16-bit units per code point in the range {{tt|U+010000}} to {{tt|U+10FFFF}} * [[UTF-32]], which uses one 32-bit unit per code point * [[UTF-EBCDIC]], not specified as part of ''The Unicode Standard'', which uses one to five 8-bit units per code point, intended to maximize compatibility with [[EBCDIC]] UTF-8 uses one to four 8-bit units (''bytes'') per code point and, being compact for Latin scripts and ASCII-compatible, provides the de facto standard encoding for the interchange of Unicode text. It is used by [[FreeBSD]] and most recent [[Linux distributions]] as a direct replacement for legacy encodings in general text handling. The UCS-2 and UTF-16 encodings specify the Unicode [[byte order mark]] (BOM) for use at the beginnings of text files, which may be used for byte-order detection (or [[endianness|byte endianness]] detection). The BOM, encoded as {{unichar|FEFF|Byte order mark}}, has the important property of unambiguity on byte reorder, regardless of the Unicode encoding used; {{tt|U+FFFE}} (the result of byte-swapping {{tt|U+FEFF}}) does not equate to a legal character, and {{tt|U+FEFF}} in places other than the beginning of text conveys the zero-width non-break space. The same character converted to UTF-8 becomes the byte sequence <code>EF BB BF</code>. ''The Unicode Standard'' allows the BOM "can serve as a signature for UTF-8 encoded text where the character set is unmarked".<ref>{{Cite book |title=The Unicode Standard, Version 6.2 |publisher=The Unicode Consortium |year=2013 |isbn=978-1-936213-08-5 |page=561}}</ref> Some software developers have adopted it for other encodings, including UTF-8, in an attempt to distinguish UTF-8 from local 8-bit [[code page]]s. However {{IETF RFC|3629}}, the UTF-8 standard, recommends that byte order marks be forbidden in protocols using UTF-8, but discusses the cases where this may not be possible. In addition, the large restriction on possible patterns in UTF-8 (for instance there cannot be any lone bytes with the high bit set) means that it should be possible to distinguish UTF-8 from other character encodings without relying on the BOM. In UTF-32 and UCS-4, one [[32-bit computing|32-bit]] code unit serves as a fairly direct representation of any character's code point (although the endianness, which varies across different platforms, affects how the code unit manifests as a byte sequence). In the other encodings, each code point may be represented by a variable number of code units. UTF-32 is widely used as an internal representation of text in programs (as opposed to stored or transmitted text), since every Unix operating system that uses the [[GNU Compiler Collection|GCC]] compilers to generate software uses it as the standard "[[wide character]]" encoding. Some programming languages, such as [[Seed7]], use UTF-32 as an internal representation for strings and characters. Recent versions of the [[Python (programming language)|Python]] programming language (beginning with 2.2) may also be configured to use UTF-32 as the representation for Unicode strings, effectively disseminating such encoding in [[high-level programming language|high-level]] coded software. [[Punycode]], another encoding form, enables the encoding of Unicode strings into the limited character set supported by the [[ASCII]]-based [[Domain Name System]] (DNS). The encoding is used as part of [[IDNA]], which is a system enabling the use of [[Internationalized Domain Names]] in all scripts that are supported by Unicode. Earlier and now historical proposals include [[UTF-5]] and [[UTF-6]]. [[GB 18030|GB18030]] is another encoding form for Unicode, from the [[Standardization Administration of China]]. It is the official [[character set]] of the People's Republic of China (PRC). [[Binary Ordered Compression for Unicode|BOCU-1]] and [[Standard Compression Scheme for Unicode|SCSU]] are Unicode compression schemes. The [[April Fools' Day RFC]] of 2005 specified two parody UTF encodings, [[UTF-9]] and [[UTF-18]].
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)