Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Variable-width encoding
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==CJK multibyte encodings== The first use of multibyte encodings was for the encoding of Chinese, Japanese and Korean, which have large character sets well in excess of 256 characters. At first the encoding was constrained to the limit of 7 bits. The [[ISO/IEC 2022|ISO-2022-JP, ISO-2022-CN and ISO-2022-KR encodings]] used the range 21β7E (hexadecimal) for both lead units and trail units, and marked them off from the singletons by using ISO 2022 escape sequences to switch between single-byte and multibyte mode. A total of 8,836 (94Γ94) characters could be encoded at first, and further sets of 94Γ94 characters with switching. The ISO 2022 encoding schemes for CJK are still in use on the Internet. The stateful nature of these encodings and the large overlap make them very awkward to process. On [[Unix]] platforms, the ISO 2022 7-bit encodings were replaced by a set of 8-bit encoding schemes, the Extended Unix Code: EUC-JP, EUC-CN and EUC-KR. Instead of distinguishing between the multiunit sequences and the singletons with escape sequences, which made the encodings stateful, multiunit sequences were marked by having the most significant bit set, that is, being in the range 80βFF (hexadecimal), while the singletons were in the range 00β7F alone. The lead units and trail units were in the range A1 to FE (hexadecimal), that is, the same as their range in the ISO 2022 encodings, but with the high bit set to 1. These encodings were reasonably easy to work with provided all your delimiters were [[ASCII]] characters and you avoided truncating strings to fixed lengths, but a break in the middle of a multibyte character could still cause major corruption. On the PC ([[DOS]] and [[Microsoft Windows]] platforms), two encodings became established for Japanese and Traditional Chinese in which all of singletons, lead units and trail units overlapped: [[Shift-JIS]] and [[Big5]] respectively. In Shift-JIS, lead units had the range 81β9F and E0βFC, trail units had the range 40β7E and 80βFC, and singletons had the range 21β7E and A1βDF. In Big5, lead units had the range A1βFE, trail units had the range 40β7E and A1βFE, and singletons had the range 21β7E (all values in hexadecimal). This overlap again made processing tricky, though at least most of the symbols had unique byte values (though strangely the backslash does not).<!-- FIXME: GBK and code page 949 should probably also be mentioned here. -->
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)