Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Extended Unix Code
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Encoding structure== [[File:Ecma43 versus EUC.svg|thumb|right|Relationship between packed EUC and other 8-bit {{nowrap|ISO 2022}} profiles]] The structure of EUC is based on the {{nowrap|[[ISO/IEC 2022]]}} standard, which specifies a system of graphical character sets that can be represented with a sequence of the 94 7-bit bytes [[hexadecimal|0x]]21β7E, or alternatively 0xA1βFE if an eighth bit is available. This allows for sets of 94 graphical characters, or 8836 (94<sup>2</sup>) characters, or 830584 (94<sup>3</sup>) characters. Although initially 0x20 and 0x7F were always the [[space character|space]] and {{ctrl|DEL|delete character}} and 0xA0 and 0xFF were unused, later editions of {{nowrap|ISO/IEC 2022}} allowed the use of the bytes 0xA0 and 0xFF (or 0x20 and 0x7F) within sets under certain circumstances, allowing the inclusion of 96-character sets. The ranges 0x00β1F and 0x80β9F are used for [[C0 and C1 control codes]]. EUC is a family of 8-bit profiles of {{nowrap|ISO/IEC 2022}}, as opposed to 7-bit profiles such as [[ISO-2022-JP]]. As such, only {{nowrap|ISO 2022}} compliant character sets can have EUC forms. Up to four coded character sets (referred to as G0, G1, G2, and G3 or as code sets 0, 1, 2, and 3) can be represented with the EUC scheme. The G0 set is set to an {{nowrap|[[ISO/IEC 646]]}} compliant coded character set such as [[ASCII]], {{nowrap|ISO 646:KR}} ({{nowrap|KS X 1003}}) or {{nowrap|[[JISCII|ISO 646:JP]]}} (the lower half of {{nowrap|JIS X 0201}}) and invoked over GL (i.e. 0x21β0x7E, with the most significant bit cleared).<ref name="cdra" /> If ASCII is used, this makes the code an [[extended ASCII]] encoding; the most common deviation from ASCII is that 0x5C ([[backslash]] in ASCII) is often used to represent a [[yen sign]] in EUC-JP (see below) and a [[won sign]] in EUC-KR. The other code sets are invoked over GR (i.e. with the most significant bit set). Hence, to get the EUC form of a character, the most significant bit of each coding byte is set (equivalent to adding 128 to each 7-bit coding byte, or adding 160 to each number in the [[kuten]] code); this allows the software to easily distinguish whether a particular byte in a [[character string]] belongs to the {{nowrap|ISO 646}} code or the extended code. Characters in code sets 2 and 3 are prefixed with the control codes {{ctrl|SS2}} (0x8E) and {{ctrl|SS3}} (0x8F) respectively, and invoked over GR. Besides the initial shift code, any byte outside of the range 0xA0β0xFF appearing in a character from code sets 1 through 3 is not a valid EUC code.<ref name="cdra" /> The EUC code itself does not make use of the announcement and designation sequences from {{nowrap|ISO 2022}}.<ref name="cdra" /> However, the code specification is equivalent to the following sequence of four {{nowrap|ISO 2022}} announcement sequences, with meanings breaking down as follows.<ref name="cdra">{{cite web |url=https://www.ibm.com/downloads/cas/G01BQVRV#page=157 |pages=157β162 |title=Character Data Representation Architecture (CDRA) |author=IBM |website=[[IBM]] |author-link=IBM}}</ref> {|class=wikitable |- !Individual sequence!!Hexadecimal!!Feature of EUC denoted |- |<code>ESC SP C</code>||<code>1B 20 43</code>||ISO-8 (8-bit, G0 in GL, G1 in GR) |- |<code>ESC SP Z</code>||<code>1B 20 5A</code>||G2 accessed using SS2 |- |<code>ESC SP [</code>||<code>1B 20 5B</code>||G3 accessed using SS3 |- |<code>ESC SP \</code>||<code>1B 20 5C</code>||Single-shifts invoke over GR |} ===Fixed-length format=== [[File:CsEucFixWidJapanese.svg|right|thumb|Layout of the fixed-length format for Japanese]] The ISO-2022-based [[variable-width encoding|variable-length encoding]] described above is sometimes referred to as the ''EUC packed format'', which is the encoding format usually labeled as EUC. However, internal processing of EUC data may make use of a fixed-length transformation format called the '''EUC complete two-byte format'''. This represents:<ref name="lunde" /> * Code set 0 as two bytes in the range 0x21β0x7E (except that the first may be 0x00). * Code set 1 as two bytes in the range 0xA0β0xFF (except that the first may be 0x80). * Code set 2 as a byte in the range 0x21β0x7E (or 0x00) followed by a byte in the range 0xA0β0xFF. * Code set 3 as a byte in the range 0xA0β0xFF (or 0x80) followed by a byte in the range 0x21β0x7E. Initial bytes of 0x00 and 0x80 are used in cases where the code set uses only one byte. There is also a four-byte fixed-length format.<ref name="lunde" /> These fixed-length encoding formats are suited to internal processing and are not usually encountered in interchange. EUC-JP is registered with the IANA in both formats, the packed format as "EUC-JP" or "csEUCPkdFmtJapanese" and the fixed width format as "csEUCFixWidJapanese".<ref>{{cite web | url=https://www.iana.org/assignments/character-sets/character-sets.xhtml | publisher=IANA | title=Character Sets}}</ref> Only the packed format is included in the [[WHATWG]] Encoding Standard used by [[HTML5]].<ref>{{cite web | url=https://encoding.spec.whatwg.org/#names-and-labels | title=4.2. Names and labels | publisher=WHATWG | work=Encoding Standard}}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)