Editing UTF-8 (section)

== Description ==
UTF-8 encodes code points in one to four bytes, depending on the value of the code point. In the following table, the characters {{mono|u}} to {{mono|z}} are replaced by the bits of the code point, from the positions {{mono|U+uvwxyz}}:

{| class="wikitable"
|+ Code point ↔ UTF-8 conversion
|-
! First code point
! Last code point
! Byte 1
! Byte 2
! Byte 3
! Byte 4
|-
| style="text-align: right" | {{tt|U+0000}}
| style="text-align: right" | {{tt|U+007F}}
| {{mono|0yyyzzzz}}
| style="background: darkgray" colspan=3 |
|-
| style="text-align: right" | {{tt|U+0080}}
| style="text-align: right" | {{tt|U+07FF}}
| {{mono|110xxxyy}}
| {{mono|10yyzzzz}}
| style="background: darkgray" colspan=2 |
|-
| style="text-align: right" | {{tt|U+0800}}
| style="text-align: right" | {{tt|U+FFFF}}
| {{mono|1110wwww}}
| {{mono|10xxxxyy}}
| {{mono|10yyzzzz}}
| style="background: darkgray" |
|-
| style="text-align: right" | {{tt|U+010000}}
| style="text-align: right" | {{tt|U+10FFFF}}
| {{mono|11110uvv}}
| {{mono|10vvwwww}}
| {{mono|10xxxxyy}}
| {{mono|10yyzzzz}}
|}

The first 128&nbsp;code points (ASCII) need 1&nbsp;byte. The next 1,920&nbsp;code points need two bytes to encode, which covers the remainder of almost all [[Latin-script alphabet]]s, and also [[International Phonetic Alphabet|IPA extensions]], [[Greek alphabet|Greek]], [[Cyrillic script|Cyrillic]], [[Coptic alphabet|Coptic]], [[Armenian alphabet|Armenian]], [[Hebrew alphabet|Hebrew]], [[Arabic alphabet|Arabic]], [[Syriac alphabet|Syriac]], [[Thaana]] and [[N'Ko script|N'Ko]] alphabets, as well as [[Combining Diacritical Marks]]. Three bytes are needed for the remaining 61,440&nbsp;codepoints of the [[Basic Multilingual Plane]] (BMP), including most [[CJK characters|Chinese, Japanese and Korean characters]]. Four bytes are needed for the 1,048,576&nbsp;non-BMP code points, which include [[emoji]], less common [[CJK characters]], and other useful characters.<ref name="problems_of_only_BMP">{{Cite web |last=Lunde |first=Ken |date=2022-01-09 |title=2022 Top Ten List: Why Support Beyond-BMP Code Points? |url=https://ken-lunde.medium.com/2022-top-ten-list-why-support-beyond-bmp-code-points-6a946d7735f9 |website=Medium |language=en|access-date=2024-01-07}}</ref>

UTF-8 is a ''[[prefix code]]'' and it is unnecessary to read past the last byte of a code point to decode it. Unlike many earlier multi-byte text encodings such as [[Shift-JIS]], it is ''[[Self-synchronizing code|self-synchronizing]]'' so searches for short strings or characters are possible and that the start of a code point can be found from a random position by backing up at most 3 bytes. The values chosen for the lead bytes means sorting a list of UTF-8 strings puts them in the same order as sorting [[UTF-32]] strings.

=== Overlong encodings ===
{{anchor|overlong encodings}}

Using a row in the above table to encode a code point less than "First code point" (thus using more bytes than necessary) is termed an ''overlong encoding''. These are a security problem because they allow character sequences such as malicious JavaScript and <code>[[directory traversal attack|../]]</code> to bypass security validations, which has been reported in numerous high-profile products such as Microsoft's [[Internet Information Services|IIS]] web server<ref name=MS00-078>{{ cite report | first = Marvin |last = Marin | date = 2000-10-17 | title = Windows NT UNICODE vulnerability analysis | department = Web server folder traversal | id = MS00-078 | series = Malware FAQ | website=SANS Institute | url=https://www.sans.org/resources/malwarefaq/wnt-unicode.php | url-status=dead | archive-url=https://web.archive.org/web/20140827001204/http://www.sans.org/security-resources/malwarefaq/wnt-unicode.php | archive-date=Aug 27, 2014 }}</ref> and Apache's Tomcat servlet container.<ref name=CVE-2008-2938>{{ cite web | title = CVE-2008-2938 | year = 2008 | website = National Vulnerability Database (nvd.nist.gov) | publisher = U.S. [[National Institute of Standards and Technology]] | url = https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2008-2938 }}</ref> Overlong encodings should therefore be considered an error and never decoded. [[#Modified UTF-8|Modified UTF-8]] allows an overlong encoding of {{tt|U+0000}}.

=== Byte map ===
The chart below gives the detailed meaning of each byte in a stream encoded in UTF-8.
{{UTF-8 byte map}}

=== Error handling ===
Not all sequences of bytes are valid UTF-8. A UTF-8 decoder should be prepared for:

* Bytes that never appear in UTF-8: {{tt|0xC0}}, {{tt|0xC1}}, {{tt|0xF5}}{{ndash}}{{tt|0xFF}}
* A "continuation byte" ({{tt|0x80}}{{ndash}}{{tt|0xBF}}) at the start of a character
* A non-continuation byte (or the string ending) before the end of a character
* An overlong encoding ({{tt|0xE0}} followed by less than {{tt|0xA0}}, or {{tt|0xF0}} followed by less than {{tt|0x90}})
* A 4-byte sequence that decodes to a value greater than {{tt|U+10FFFF}} ({{tt|0xF4}} followed by {{tt|0x90}} or greater)

Many of the first UTF-8 decoders would decode these, ignoring incorrect bits. Carefully crafted invalid UTF-8 could make them either skip or create ASCII characters such as {{mono|NUL}}, slash, or quotes, leading to security vulnerabilities. It is also common to throw an exception or truncate the string at an error<ref>{{ cite web | title = DataInput | series = Java Platform SE 8 | website = docs.oracle.com | url = https://docs.oracle.com/javase/8/docs/api/java/io/DataInput.html | access-date = 2021-03-24 }}</ref> but this turns what would otherwise be harmless errors (i.e. "file not found") into a [[denial of service]], for instance early versions of Python 3.0 would exit immediately if the command line or [[environment variable]]s contained invalid UTF-8.<ref name=PEP383>{{ cite web | title = Non-decodable bytes in system character interfaces | date = 2009-04-22 | website = python.org | url = https://www.python.org/dev/peps/pep-0383/ | access-date = 2014-08-13 }}</ref>

{{nobr|RFC 3629}} states "Implementations of the decoding algorithm MUST protect against decoding invalid sequences."<ref name="rfc3629">{{cite IETF |title=UTF-8, a transformation format of ISO 10646 |rfc=3629 |std=63 |last1=Yergeau |first1=F. |date=November 2003 |publisher=[[Internet Engineering Task Force|IETF]] |access-date=August 20, 2020}}</ref> ''The Unicode Standard'' requires decoders to: "...&nbsp;treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence."<!-- anyone have a copy of ISO/IEC 10646-1:2000 annex D for comparison?  --> The standard now recommends replacing each error with the [[replacement character]] "�" ({{tt|U+FFFD}}) and continue decoding.

Some decoders consider the sequence {{mono|E1,A0,20}} (a truncated 3-byte code followed by a space) as a single error. This is not a good idea as a search for a space character would find the one hidden in the error. Since Unicode&nbsp;6 (October&nbsp;2010)<ref>{{ cite report |  title = Unicode 6.0.0 |  date = October 2010 |  website = unicode.org |  url = https://www.unicode.org/versions/Unicode6.0.0/ }}</ref> the standard (chapter&nbsp;3) has recommended a "best practice" where the error is either one continuation byte, or ends at the first byte that is disallowed, so {{mono|E1,A0,20}} is a two-byte error followed by a space. This means an error is no more than three bytes long and never contains the start of a valid character, and there are {{val|21952|fmt=commas}}&nbsp;different possible errors. Technically this makes UTF-8 no longer a [[prefix code]] (the decoder has to read one byte past some errors to figure out they are an error), but searching still works if the searched-for string does not contain any errors.

Making each byte be an error, in which case {{mono|E1,A0,20}} is ''two'' errors followed by a space, also still allows searching for a valid string. This means there are only 128 different errors which makes it practical to store the errors in the output string,<ref name="pep383"/> or replace them with characters from a legacy encoding.

Only a small subset of possible byte strings are error-free UTF-8: several bytes cannot appear; a byte with the high bit set cannot be alone; and in a truly random string a byte with a high bit set has only a {{frac|1|15}} chance of starting a valid UTF-8 character. This has the consequence of making it easy to detect if a legacy text encoding is accidentally used instead of UTF-8, making conversion of a system to UTF-8 easier and avoiding the need to require a Byte Order Mark or any other metadata.

=== Surrogates ===
Since RFC 3629 (November&nbsp;2003), the high and low surrogates used by [[UTF-16]] ({{tt|U+D800}} through {{tt|U+DFFF}}) are not legal Unicode values, and their UTF-8 encodings must be treated as an invalid byte sequence.<ref name="rfc3629"/> These encodings all start with {{tt|0xED}} followed by {{tt|0xA0}} or higher. This rule is often ignored as surrogates are allowed in Windows filenames and this means there must be a way to store them in a string.<ref name="PEP 529">{{ cite web | title = Change Windows filesystem encoding to UTF-8 | id = PEP&nbsp;529 | website = Python.org   |language = en | url = https://www.python.org/dev/peps/pep-0529/ | access-date = 2022-05-10 }}</ref> UTF-8 that allows these surrogate halves has been (informally) called ''{{visible anchor|WTF-8}}'',<ref name="wtf-8">{{cite web | title = The WTF-8 encoding | url = https://simonsapin.github.io/wtf-8/}}</ref> while another variation that also encodes all non-BMP characters as two surrogates (6&nbsp;bytes instead of 4) is called ''[[CESU-8]]''.

=== Byte-order mark ===
If the Unicode [[byte-order mark]] {{tt|U+FEFF}} is at the start of a UTF-8 file, the first three bytes will be {{mono|0xEF}}, {{mono|0xBB}}, {{mono|0xBF}}.

The Unicode Standard neither requires nor recommends the use of the BOM for UTF-8, but warns that it may be encountered at the start of a file trans-coded from another encoding.<ref>{{citation | chapter-url = https://www.unicode.org/versions/Unicode15.0.0/ch02.pdf | title = The Unicode Standard&nbsp;— Version 15.0.0 | chapter = Chapter 2 | page = 39 }}</ref> While ASCII text encoded using UTF-8 is backward compatible with ASCII, this is not true when Unicode Standard recommendations are ignored and a BOM is added. A BOM can confuse software that isn't prepared for it but can otherwise accept UTF-8, e.g. programming languages that permit non-ASCII bytes in [[string literal]]s but not at the start of the file. Nevertheless, there was and still is software that always inserts a BOM when writing UTF-8, and refuses to correctly interpret UTF-8 unless the first character is a BOM (or the file only contains ASCII).<ref>{{Cite web |title=UTF-8 and Unicode FAQ for Unix/Linux |url=https://www.cl.cam.ac.uk/~mgk25/unicode.html}}</ref>