Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Character encoding
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Unicode encoding model== [[Unicode]] and its parallel standard, the ISO/IEC 10646 [[Universal Character Set]], together constitute a unified standard for character encoding. Rather than mapping characters directly to [[byte]]s, Unicode separately defines a coded character set that maps characters to unique natural numbers ([[code point]]s), how those code points are mapped to a series of fixed-size natural numbers (code units), and finally how those units are encoded as a stream of octets (bytes). The purpose of this decomposition is to establish a universal set of characters that can be encoded in a variety of ways. To describe this model precisely, Unicode uses its own set of terminology to describe its process:<ref name="utr17">{{cite web |last1=Whistler |first1=Ken |last2=Freytag |first2=Asmus |title=UTR#17: Unicode Character Encoding Model |url=https://www.unicode.org/reports/tr17/ |publisher=Unicode Consortium |access-date=12 August 2023 |date=2022-11-11}}</ref> An '''abstract character repertoire''' (ACR) is the full set of abstract characters that a system supports. Unicode has an open repertoire, meaning that new characters will be added to the repertoire over time. A '''coded character set''' (CCS) is a [[function (mathematics)|function]] that maps characters to ''[[code point]]s'' (each code point represents one character). For example, in a given repertoire, the capital letter "A" in the Latin alphabet might be represented by the code point 65, the character "B" by 66, and so on. Multiple coded character sets may share the same character repertoire; for example [[ISO/IEC 8859-1]] and IBM code pages 037 and [[Code page 500|500]] all cover the same repertoire but map them to different code points. A '''character encoding form''' (CEF) is the mapping of code points to ''code units'' to facilitate storage in a system that represents numbers as bit sequences of fixed length (i.e. practically any computer system). For example, a system that stores numeric information in 16-bit units can only directly represent code points 0 to 65,535 in each unit, but larger code points (say, 65,536 to 1.4 million) could be represented by using multiple 16-bit units. This correspondence is defined by a CEF. A '''character encoding scheme''' (CES) is the mapping of code units to a sequence of octets to facilitate storage on an octet-based file system or transmission over an octet-based network. Simple character encoding schemes include [[UTF-8]], [[UTF-16BE]], [[UTF-32BE]], [[UTF-16LE]], and [[UTF-32LE]]; compound character encoding schemes, such as [[UTF-16]], [[UTF-32]] and [[ISO/IEC 2022]], switch between several simple schemes by using a [[byte order mark]] or [[escape sequence]]s; compressing schemes try to minimize the number of bytes used per code unit (such as [[Standard Compression Scheme for Unicode|SCSU]] and [[Binary Ordered Compression for Unicode|BOCU]]). Although [[UTF-32BE]] and [[UTF-32LE]] are simpler CESes, most systems working with Unicode use either [[UTF-8]], which is [[backward compatibility|backward compatible]] with fixed-length ASCII and maps Unicode code points to variable-length sequences of octets, or [[UTF-16BE]],{{cn|date=August 2023}} which is [[backward compatibility|backward compatible]] with fixed-length UCS-2BE and maps Unicode code points to variable-length sequences of 16-bit words. See [[comparison of Unicode encodings]] for a detailed discussion. Finally, there may be a '''higher-level protocol''' which supplies additional information to select the particular variant of a [[Unicode]] character, particularly where there are regional variants that have been 'unified' in Unicode as the same character. An example is the [[XML]] attribute xml:lang. The Unicode model uses the term "character map" for other systems which directly assign a sequence of characters to a sequence of bytes, covering all of the CCS, CEF and CES layers.<ref name="utr17" /> ===Unicode code points=== In Unicode, a character can be referred to as 'U+' followed by its codepoint value in hexadecimal. The range of valid code points (the codespace) for the Unicode standard is U+0000 to U+10FFFF, inclusive, divided in 17 [[Plane (Unicode)|planes]], identified by the numbers 0 to 16. Characters in the range U+0000 to U+FFFF are in plane 0, called the [[Plane (Unicode)#Basic Multilingual Plane|Basic Multilingual Plane]] (BMP). This plane contains the most commonly-used characters. Characters in the range U+10000 to U+10FFFF in the other planes are called [[supplementary characters]]. The following table shows examples of code point values: {| class="wikitable MsoNormalTable" ! Character ! Unicode code point ! Glyph |- | Latin A | U+0041 | Ξ |- | Latin sharp S | U+00DF | Γ |- | Han for East | U+6771 | ζ± |- | Ampersand | U+0026 | & |- | Inverted exclamation mark | U+00A1 | Β‘ |- | Section sign | U+00A7 | Β§ |} ===Example=== Consider a [[String (computer science)|string]] of the letters "abΜ²cπ"βthat is, a string containing a Unicode combining character ({{unichar|0332}}) as well as a supplementary character ({{unichar|10400}}). This string has several Unicode representations which are logically equivalent, yet while each is suited to a diverse set of circumstances or range of requirements: * Four [[Character (computing)|composed characters]]: *:{{code|a}}, {{code|bΜ²}}, {{code|c}}, {{code|π}} * Five [[grapheme]]s: *:{{code|a}}, {{code|b}}, {{code|_}}, {{code|c}}, {{code|π}} * Five Unicode [[code point]]s: *:{{code|U+0061}}, {{code|U+0062}}, {{code|U+0332}}, {{code|U+0063}}, {{code|U+10400}} * Five UTF-32 code units (32-bit integer values): *:{{code|0x00000061}}, {{code|0x00000062}}, {{code|0x00000332}}, {{code|0x00000063}}, {{code|0x00010400}} * Six UTF-16 code units (16-bit integers) *:{{code|0x0061}}, {{code|0x0062}}, {{code|0x0332}}, {{code|0x0063}}, {{code|0xD801}}, {{code|0xDC00}} * Nine UTF-8 code units (8-bit values, or [[byte]]s) *:{{code|0x61}}, {{code|0x62}}, {{code|0xCC}}, {{code|0xB2}}, {{code|0x63}}, {{code|0xF0}}, {{code|0x90}}, {{code|0x90}}, {{code|0x80}} Note in particular that π is represented with either one 32-bit value (UTF-32), two 16-bit values (UTF-16), or four 8-bit values (UTF-8). Although each of those forms uses the same total number of bits (32) to represent the glyph, it is not obvious how the actual numeric byte values are related.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)