Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Variable-width encoding
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Unicode variable-width encodings== The [[Unicode]] standard has two variable-width encodings: [[UTF-8]] and [[UTF-16]] (it also has a fixed-width encoding, [[UTF-32]]). Originally, both the Unicode and [[ISO 10646|ISO 10646]] standards were meant to be fixed-width, with Unicode being 16-bit and ISO 10646 being 32-bit.{{Citation needed|date=April 2013}} ISO 10646 provided a variable-width encoding called [[UTF-1]], in which singletons had the range 00–9F, lead units the range A0–FF and trail units the ranges A0–FF and 21–7E. Because of this bad design, similar to [[Shift JIS]] and [[Big5]] in its overlap of values, the inventors of the [[Plan 9 from Bell Labs|Plan 9]] operating system, the first to implement Unicode throughout, abandoned it and replaced it with a much better designed variable-width encoding for Unicode: UTF-8, in which singletons have the range 00–7F, lead units have the range C0–FD (now actually C2–F4, to avoid overlong sequences and to maintain synchronism with the encoding capacity of UTF-16; see the [[UTF-8]] article), and trail units have the range 80–BF. The lead unit also tells how many trail units follow: one after C2–DF, two after E0–EF and three after F0–F4.{{efn|In the original version of UTF-8, from its 1992 publication until its code space was restricted to that of UTF-16 in 2003, the range of lead units encoding three-unit trailing sequences was larger (F0–F7); additionally, the lead units F8–FB were followed by four trail units, and FC–FD by five. FE–FF were never valid lead or trail units in any version of UTF-8.}} UTF-16 was devised to break free of the 65,536-character limit of the original Unicode (1.x) without breaking compatibility with the 16-bit encoding. In UTF-16, singletons have the range 0000–D7FF (55,296 code points) and E000–FFFF (8192 code points, 63,488 in total), lead units the range D800–DBFF (1024 code points) and trail units the range DC00–DFFF (1024 code points, 2048 in total). The lead and trail units, called ''high surrogates'' and ''low surrogates'', respectively, in Unicode terminology, map 1024×1024 or 1,048,576 supplementary characters, making 1,112,064 (63,488 BMP code points + 1,048,576 code points represented by high and low surrogate pairs) encodable code points, or ''scalar values'' in Unicode parlance (surrogates are not encodable).
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)