Editing UTF-8 (section)

=== Error handling ===
Not all sequences of bytes are valid UTF-8. A UTF-8 decoder should be prepared for:

* Bytes that never appear in UTF-8: {{tt|0xC0}}, {{tt|0xC1}}, {{tt|0xF5}}{{ndash}}{{tt|0xFF}}
* A "continuation byte" ({{tt|0x80}}{{ndash}}{{tt|0xBF}}) at the start of a character
* A non-continuation byte (or the string ending) before the end of a character
* An overlong encoding ({{tt|0xE0}} followed by less than {{tt|0xA0}}, or {{tt|0xF0}} followed by less than {{tt|0x90}})
* A 4-byte sequence that decodes to a value greater than {{tt|U+10FFFF}} ({{tt|0xF4}} followed by {{tt|0x90}} or greater)

Many of the first UTF-8 decoders would decode these, ignoring incorrect bits. Carefully crafted invalid UTF-8 could make them either skip or create ASCII characters such as {{mono|NUL}}, slash, or quotes, leading to security vulnerabilities. It is also common to throw an exception or truncate the string at an error<ref>{{ cite web | title = DataInput | series = Java Platform SE 8 | website = docs.oracle.com | url = https://docs.oracle.com/javase/8/docs/api/java/io/DataInput.html | access-date = 2021-03-24 }}</ref> but this turns what would otherwise be harmless errors (i.e. "file not found") into a [[denial of service]], for instance early versions of Python 3.0 would exit immediately if the command line or [[environment variable]]s contained invalid UTF-8.<ref name=PEP383>{{ cite web | title = Non-decodable bytes in system character interfaces | date = 2009-04-22 | website = python.org | url = https://www.python.org/dev/peps/pep-0383/ | access-date = 2014-08-13 }}</ref>

{{nobr|RFC 3629}} states "Implementations of the decoding algorithm MUST protect against decoding invalid sequences."<ref name="rfc3629">{{cite IETF |title=UTF-8, a transformation format of ISO 10646 |rfc=3629 |std=63 |last1=Yergeau |first1=F. |date=November 2003 |publisher=[[Internet Engineering Task Force|IETF]] |access-date=August 20, 2020}}</ref> ''The Unicode Standard'' requires decoders to: "...&nbsp;treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence."<!-- anyone have a copy of ISO/IEC 10646-1:2000 annex D for comparison?  --> The standard now recommends replacing each error with the [[replacement character]] "�" ({{tt|U+FFFD}}) and continue decoding.

Some decoders consider the sequence {{mono|E1,A0,20}} (a truncated 3-byte code followed by a space) as a single error. This is not a good idea as a search for a space character would find the one hidden in the error. Since Unicode&nbsp;6 (October&nbsp;2010)<ref>{{ cite report |  title = Unicode 6.0.0 |  date = October 2010 |  website = unicode.org |  url = https://www.unicode.org/versions/Unicode6.0.0/ }}</ref> the standard (chapter&nbsp;3) has recommended a "best practice" where the error is either one continuation byte, or ends at the first byte that is disallowed, so {{mono|E1,A0,20}} is a two-byte error followed by a space. This means an error is no more than three bytes long and never contains the start of a valid character, and there are {{val|21952|fmt=commas}}&nbsp;different possible errors. Technically this makes UTF-8 no longer a [[prefix code]] (the decoder has to read one byte past some errors to figure out they are an error), but searching still works if the searched-for string does not contain any errors.

Making each byte be an error, in which case {{mono|E1,A0,20}} is ''two'' errors followed by a space, also still allows searching for a valid string. This means there are only 128 different errors which makes it practical to store the errors in the output string,<ref name="pep383"/> or replace them with characters from a legacy encoding.

Only a small subset of possible byte strings are error-free UTF-8: several bytes cannot appear; a byte with the high bit set cannot be alone; and in a truly random string a byte with a high bit set has only a {{frac|1|15}} chance of starting a valid UTF-8 character. This has the consequence of making it easy to detect if a legacy text encoding is accidentally used instead of UTF-8, making conversion of a system to UTF-8 easier and avoiding the need to require a Byte Order Mark or any other metadata.