Editing Big5 (section)

==Encoding==
The original Big5 character set is sorted first by usage frequency, second by stroke count, lastly by [[List of Kangxi radicals|Kangxi radical]].

The original Big5 character set lacked many commonly used characters. To solve this problem, each vendor developed its own extension. The [[ETen Chinese System|ETen]] extension became part of the current Big5 standard through popularity.

The structure of Big5 does not conform to the [[ISO 2022]] standard, but rather bears a certain similarity to the {{nowrap|[[Shift JIS]]}} encoding. It is a [[double-byte character set|double-byte character set (DBCS)]] with the following structure:
{| border=1 style="border-collapse: collapse" class="wikitable plainrowheaders"
|-
! scope="row"| First byte ("lead byte") 
| {{mono|0x81}} to {{mono|0xfe}} (or {{mono|0xa1}} to {{mono|0xf9}} for non-user-defined characters) 
|-
! scope="row"| Second byte 
| {{mono|0x40}} to {{mono|0x7e}}, {{mono|0xa1}} to {{mono|0xfe}}
|}
(the prefix 0x signifying hexadecimal numbers).

Standard assignments (excluding vendor or user-defined extensions) do not use the bytes {{mono|0x7F}} through {{mono|0xA0}}, nor {{mono|0xFF}}, as either lead (first) or trail (second) bytes. Bytes {{mono|0xA1}} through {{mono|0xFE}} are used for both lead and trail bytes for double-byte (Big5) codes. Bytes {{mono|0x40}} through {{mono|0x7E}} are used as trail bytes following a lead byte, or for single-byte codes otherwise. If the second byte is not in either range, [[unspecified behavior|behavior is unspecified]] (i.e., varies from system to system). Additionally, certain variants of the Big5 character set, for example the [[HKSCS]], use an expanded range for the lead byte, including values in the {{mono|0x81}} to {{mono|0xA0}} range (similar to {{nowrap|Shift JIS}}), whereas others use reduced lead byte ranges (for instance, the Apple Macintosh variant uses {{mono|0xFD}} through {{mono|0xFF}} as single-byte codes, limiting the lead byte range to {{mono|0xA1}} through {{mono|0xFC}}).<ref name="mactradchinese">{{citation|mode=cs1|url=https://unicode.org/Public/MAPPINGS/VENDORS/APPLE/CHINTRAD.TXT|title=Map (external version) from Mac OS Chinese Traditional encoding to Unicode 3.0 and later.|author=Apple, Inc|author-link=Apple, Inc|publisher=[[Unicode Consortium]]|date=2005-04-04|orig-year=1996-06-31|access-date=2021-02-24|archive-date=2021-05-14|archive-url=https://web.archive.org/web/20210514182521/https://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/CHINTRAD.TXT|url-status=live}}</ref>

The numerical value of individual Big5 codes are frequently given as a 4-digit hexadecimal number, which describes the two bytes that comprise the Big5 code as if the two bytes were a [[big endian]] representation of a 16-bit number. For example, the Big5 code for a full-width space, which are the bytes {{mono|0xa1}} {{mono|0x40}}, is usually written as {{mono|0xa140}} or just A140.

Strictly speaking, the Big5 encoding contains only DBCS characters. However, in practice, the Big5 codes are always used together with an unspecified, system-dependent [[SBCS|single-byte character set (SBCS)]] (such as [[ASCII]] or [[code page 437]]), so that Big5-encoded text contains a mix of double-byte characters and single-byte characters. Bytes in the range {{mono|0x00}} to {{mono|0x7f}} that are not part of a double-byte character are assumed to be single-byte characters. (For a more detailed description of this problem, please see the discussion on "The Matching SBCS" below.)

The meaning of non-ASCII single bytes outside the permitted values that are not part of a double-byte character varies from system to system. In old MSDOS-based systems, they are likely to be displayed as 8-bit characters; in modern systems, they are likely to either give unpredictable results or generate an error.

===A more detailed look at the organization===
In the original Big5, the encoding is compartmentalized into different zones:
{| class="wikitable"
|-
| {{mono|0x8140}} to {{mono|0xA0FE}}|| Reserved for user-defined characters 造字
|-
| {{mono|0xA140}} to {{mono|0xA3BF}}|| "Graphical characters" 圖形碼
|-
| {{mono|0xA3C0}} to {{mono|0xA3FE}}|| Reserved, ''not'' for user-defined characters 
|-
| {{mono|0xA440}} to {{mono|0xC67E}}|| Frequently used characters 常用字
|-
| {{mono|0xC6A1}} to {{mono|0xC8FE}}|| Reserved for user-defined characters
|-
| {{mono|0xC940}} to {{mono|0xF9D5}}|| Less frequently used characters 次常用字
|-
| {{mono|0xF9D6}} to {{mono|0xFEFE}}|| Reserved for user-defined characters
|}
The "graphical characters" actually comprise punctuation marks, partial punctuation marks (e.g., half of a dash, half of an ellipsis; see below), [[dingbat]]s, foreign characters, and other special characters (e.g., presentational "full width" forms, digits for [[Suzhou numerals]], [[bopomofo|zhuyin fuhao]], etc.)

In most vendor extensions, extended characters are placed in the various zones reserved for user-defined characters, each of which are normally regarded as associated with the preceding zone. For example, additional "graphical characters" (e.g., punctuation marks) would be expected to be placed in the {{mono|0xa3c0}}–{{mono|0xa3fe}} range, and additional logograms would be placed in either the {{mono|0xc6a1}}–{{mono|0xc8fe}} or the {{mono|0xf9d6}}–{{mono|0xfefe}} range. Sometimes, this is not possible due to the large number of extended characters to be added;
for example, [[Cyrillic]] letters and Japanese [[kana]] have been placed in the zone associated with "frequently-used characters".

===Duplicates===
Big5 has encoded two duplicate characters: "兀" on 0xA461 (U+5140) and 0xC94A (U+FA0C), "嗀" on 0xDCD1 (U+55C0) and 0xDDFC (U+FA0D).

Some encoding mapping also maps the three Suzhou numerals, "〸", "〹" and "〺", in the graphical section to ideograph characters (U+5341, U+5344 and U+5345 respectively)<ref>{{Cite web|title=Unicode CP950 mapping file|url=https://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT|website=Unicode|publisher=[[Unicode Consortium]]|access-date=2023-05-11|archive-date=2023-06-27|archive-url=https://web.archive.org/web/20230627235611/https://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT|url-status=live}}</ref><ref>{{Cite web|title=Unicode Big5 mapping file|url=https://unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT|website=Unicode|publisher=[[Unicode Consortium]]|access-date=2023-05-11|archive-date=2023-06-27|archive-url=https://web.archive.org/web/20230627235404/https://unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT|url-status=live}}</ref> instead of [[CJK Symbols and Punctuation]] (U+3038, U+3039 and U+303A respectively).<ref>{{Cite web|title=Mozilla 系列與 Big5 中文字碼（Big5-2003）|url=https://moztw.org/docs/big5/table/big5_2003-b2u.txt|website=Mozilla 台湾社群|lang=zh-TW|access-date=2020-07-01|archive-date=2023-06-27|archive-url=https://web.archive.org/web/20230627234452/https://moztw.org/docs/big5/table/big5_2003-b2u.txt|url-status=live}}</ref><ref>The ETEN mapping file provided by Mozilla Taiwan community maps the three characters to both the symbol and ideograph codepoint. {{Cite web|title=Mozilla 系列與 Big5 中文字碼（ETEN）|url=https://moztw.org/docs/big5/table/eten.txt|website=Mozilla 台湾社群|lang=zh-TW|access-date=2020-07-01|archive-date=2023-06-27|archive-url=https://web.archive.org/web/20230627234353/https://moztw.org/docs/big5/table/eten.txt|url-status=live}}</ref>

===What a Big5 code actually encodes===
An individual Big5 code does not always represent a complete semantic unit. The Big5 codes of logograms are always logograms, but codes in the "graphical characters" section are not always complete "graphical characters". What Big5 encodes are particular graphical representations of characters or part of characters that happen to fit in the space taken by two monospaced ASCII characters. This is a property of [[CJK characters|CJK]] double-byte character sets, and is not a unique problem of Big5.

(The above might need some explanation by putting it in historical perspective, as it is ''theoretically'' incorrect: Back when text mode personal computing was still the norm, characters were normally represented as single bytes and each character takes one position on the screen. There was therefore a practical reason to insist that double-byte characters must take up two positions on the screen, namely that off-the-shelf, American-made software would then be usable without modification in a DBCS-based system. If a character can take an arbitrary number of screen positions, software that assumes that one ''byte'' of text takes one screen position would produce incorrect output. Of course, if a computer never had to deal with the text screen, the manufacturer would not enforce this artificial restriction; the Apple Macintosh is an example. Nevertheless, the encoding itself must be designed so that it works correctly on text-screen-based systems.)

To illustrate this point, consider the Big5 code {{mono|0xa14b}} (…). To English speakers this looks like an ellipsis and the Unicode standard identifies it as such; however, in Chinese, the ellipsis consists of six dots that fit in the space of two Chinese characters (……), so in fact there is no Big5 code for the Chinese ellipsis, and the Big5 code {{mono|0xa14b}} just represents half of a Chinese ellipsis. It represents only half of an ellipsis because the whole ellipsis should take the space of two Chinese characters, and in many DBCS systems one DBCS character must take exactly the space of one Chinese character.

Characters encoded in Big5 do not always represent things that can be readily used in plain text files; an example is "citation mark" ({{mono|0xa1ca}}, ﹋), which is, when used, required to be typeset under the title of literary works. Another example is the Suzhou numerals, which is a form of [[scientific notation]] that requires the number to be laid out in a 2-D form consisting of at least two rows.

===The Matching SBCS===
In practice, Big5 cannot be used without a matching SBCS; this is mostly to do with a compatibility reason. However, as in the case of other CJK DBCS character sets, the SBCS to use has never been specified. Big5 has always been defined as a DBCS, though when used it must be paired with a suitable, ''unspecified'' SBCS and therefore used as what some people call a [[Variable-width encoding|MBCS]]; nevertheless, Big5 by itself, as defined, is strictly a DBCS.

The SBCS to use being unspecified implies that the SBCS used can theoretically vary from system to system. Nowadays, ASCII is the only possible SBCS one would use. However, in old [[MS-DOS|DOS]]-based systems, [[code page 437]]—with its extra special symbols in the control code area including position 127—was much more common. Yet, on a Macintosh system with the Chinese Language Kit, or on a Unix system running the cxterm terminal emulator, the SBCS paired with Big5 would not be code page 437.

Outside the valid range of Big5, the old DOS-based systems would routinely interpret things according to the SBCS that is paired with Big5 on that system. In such systems, characters 127 to 160, for example, were very likely not avoided because they would produce invalid Big5, but used because they would be valid characters in code page 437.

The modern characterization of Big5 as an MBCS consisting of the DBCS of Big5 plus the SBCS of ASCII is therefore historically incorrect and potentially flawed, as the choice of the matching SBCS was, and theoretically still is, quite independent of the flavour of Big5 being used.