Editing Unicode (section)

=== {{anchor|UTF|UCS}}Mapping and encodings ===
Several mechanisms have been specified for storing a series of code points as a series of bytes.

<!-- [[Unicode Transformation Format]] redirects here -->
Unicode defines two mapping methods: the '''Unicode Transformation Format''' (UTF) encodings, and the '''[[Universal Coded Character Set]]''' (UCS) encodings. An encoding maps (possibly a subset of) the range of Unicode ''code points'' to sequences of values in some fixed-size range, termed ''code units''. All UTF encodings map code points to a unique sequence of bytes.<ref>{{Cite web |title=UTF-8, UTF-16, UTF-32 & BOM |url=https://unicode.org/faq/utf_bom.html |access-date=12 December 2016 |website=Unicode.org FAQ}}</ref> The numbers in the names of the encodings indicate the number of bits per code unit (for UTF encodings) or the number of bytes per code unit (for UCS encodings and [[UTF-1]]). UTF-8 and UTF-16 are the most commonly used encodings. [[Universal Coded Character Set|UCS-2]] is an obsolete subset of UTF-16; UCS-4 and UTF-32 are functionally equivalent.

UTF encodings include:

* [[UTF-8]], which uses one to four 8-bit units per [[code point]],<ref group=note>a [[code point]] is an abstract representation of an UCS character by an integer between 0 and 1,114,111 (1,114,112 = 2<sup>20</sup> + 2<sup>16</sup> or 17 × 2<sup>16</sup> = 0x110000 code points)</ref> and has maximal compatibility with [[ASCII]]
* [[UTF-16]], which uses one 16-bit unit per code point below {{tt|U+010000}}, and a [[Universal Character Set characters#Surrogates|surrogate pair]] of two 16-bit units per code point in the range {{tt|U+010000}} to {{tt|U+10FFFF}}
* [[UTF-32]], which uses one 32-bit unit per code point
* [[UTF-EBCDIC]], not specified as part of ''The Unicode Standard'', which uses one to five 8-bit units per code point, intended to maximize compatibility with [[EBCDIC]]

UTF-8 uses one to four 8-bit units (''bytes'') per code point and, being compact for Latin scripts and ASCII-compatible, provides the de facto standard encoding for the interchange of Unicode text. It is used by [[FreeBSD]] and most recent [[Linux distributions]] as a direct replacement for legacy encodings in general text handling.

The UCS-2 and UTF-16 encodings specify the Unicode [[byte order mark]] (BOM) for use at the beginnings of text files, which may be used for byte-order detection (or [[endianness|byte endianness]] detection). The BOM, encoded as {{unichar|FEFF|Byte order mark}}, has the important property of unambiguity on byte reorder, regardless of the Unicode encoding used; {{tt|U+FFFE}} (the result of byte-swapping {{tt|U+FEFF}}) does not equate to a legal character, and {{tt|U+FEFF}} in places other than the beginning of text conveys the zero-width non-break space.

The same character converted to UTF-8 becomes the byte sequence <code>EF BB BF</code>. ''The Unicode Standard'' allows the BOM "can serve as a signature for UTF-8 encoded text where the character set is unmarked".<ref>{{Cite book |title=The Unicode Standard, Version 6.2 |publisher=The Unicode Consortium |year=2013 |isbn=978-1-936213-08-5 |page=561}}</ref> Some software developers have adopted it for other encodings, including UTF-8, in an attempt to distinguish UTF-8 from local 8-bit [[code page]]s. However {{IETF RFC|3629}}, the UTF-8 standard, recommends that byte order marks be forbidden in protocols using UTF-8, but discusses the cases where this may not be possible. In addition, the large restriction on possible patterns in UTF-8 (for instance there cannot be any lone bytes with the high bit set) means that it should be possible to distinguish UTF-8 from other character encodings without relying on the BOM.

In UTF-32 and UCS-4, one [[32-bit computing|32-bit]] code unit serves as a fairly direct representation of any character's code point (although the endianness, which varies across different platforms, affects how the code unit manifests as a byte sequence). In the other encodings, each code point may be represented by a variable number of code units. UTF-32 is widely used as an internal representation of text in programs (as opposed to stored or transmitted text), since every Unix operating system that uses the [[GNU Compiler Collection|GCC]] compilers to generate software uses it as the standard "[[wide character]]" encoding. Some programming languages, such as [[Seed7]], use UTF-32 as an internal representation for strings and characters. Recent versions of the [[Python (programming language)|Python]] programming language (beginning with 2.2) may also be configured to use UTF-32 as the representation for Unicode strings, effectively disseminating such encoding in [[high-level programming language|high-level]] coded software.

[[Punycode]], another encoding form, enables the encoding of Unicode strings into the limited character set supported by the [[ASCII]]-based [[Domain Name System]] (DNS). The encoding is used as part of [[IDNA]], which is a system enabling the use of [[Internationalized Domain Names]] in all scripts that are supported by Unicode. Earlier and now historical proposals include [[UTF-5]] and [[UTF-6]].

[[GB 18030|GB18030]] is another encoding form for Unicode, from the [[Standardization Administration of China]]. It is the official [[character set]] of the People's Republic of China (PRC). [[Binary Ordered Compression for Unicode|BOCU-1]] and [[Standard Compression Scheme for Unicode|SCSU]] are Unicode compression schemes. The [[April Fools' Day RFC]] of 2005 specified two parody UTF encodings, [[UTF-9]] and [[UTF-18]].