Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
UTF-8
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Error handling === Not all sequences of bytes are valid UTF-8. A UTF-8 decoder should be prepared for: * Bytes that never appear in UTF-8: {{tt|0xC0}}, {{tt|0xC1}}, {{tt|0xF5}}{{ndash}}{{tt|0xFF}} * A "continuation byte" ({{tt|0x80}}{{ndash}}{{tt|0xBF}}) at the start of a character * A non-continuation byte (or the string ending) before the end of a character * An overlong encoding ({{tt|0xE0}} followed by less than {{tt|0xA0}}, or {{tt|0xF0}} followed by less than {{tt|0x90}}) * A 4-byte sequence that decodes to a value greater than {{tt|U+10FFFF}} ({{tt|0xF4}} followed by {{tt|0x90}} or greater) Many of the first UTF-8 decoders would decode these, ignoring incorrect bits. Carefully crafted invalid UTF-8 could make them either skip or create ASCII characters such as {{mono|NUL}}, slash, or quotes, leading to security vulnerabilities. It is also common to throw an exception or truncate the string at an error<ref>{{ cite web | title = DataInput | series = Java Platform SE 8 | website = docs.oracle.com | url = https://docs.oracle.com/javase/8/docs/api/java/io/DataInput.html | access-date = 2021-03-24 }}</ref> but this turns what would otherwise be harmless errors (i.e. "file not found") into a [[denial of service]], for instance early versions of Python 3.0 would exit immediately if the command line or [[environment variable]]s contained invalid UTF-8.<ref name=PEP383>{{ cite web | title = Non-decodable bytes in system character interfaces | date = 2009-04-22 | website = python.org | url = https://www.python.org/dev/peps/pep-0383/ | access-date = 2014-08-13 }}</ref> {{nobr|RFC 3629}} states "Implementations of the decoding algorithm MUST protect against decoding invalid sequences."<ref name="rfc3629">{{cite IETF |title=UTF-8, a transformation format of ISO 10646 |rfc=3629 |std=63 |last1=Yergeau |first1=F. |date=November 2003 |publisher=[[Internet Engineering Task Force|IETF]] |access-date=August 20, 2020}}</ref> ''The Unicode Standard'' requires decoders to: "... treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence."<!-- anyone have a copy of ISO/IEC 10646-1:2000 annex D for comparison? --> The standard now recommends replacing each error with the [[replacement character]] "οΏ½" ({{tt|U+FFFD}}) and continue decoding. Some decoders consider the sequence {{mono|E1,A0,20}} (a truncated 3-byte code followed by a space) as a single error. This is not a good idea as a search for a space character would find the one hidden in the error. Since Unicode 6 (October 2010)<ref>{{ cite report | title = Unicode 6.0.0 | date = October 2010 | website = unicode.org | url = https://www.unicode.org/versions/Unicode6.0.0/ }}</ref> the standard (chapter 3) has recommended a "best practice" where the error is either one continuation byte, or ends at the first byte that is disallowed, so {{mono|E1,A0,20}} is a two-byte error followed by a space. This means an error is no more than three bytes long and never contains the start of a valid character, and there are {{val|21952|fmt=commas}} different possible errors. Technically this makes UTF-8 no longer a [[prefix code]] (the decoder has to read one byte past some errors to figure out they are an error), but searching still works if the searched-for string does not contain any errors. Making each byte be an error, in which case {{mono|E1,A0,20}} is ''two'' errors followed by a space, also still allows searching for a valid string. This means there are only 128 different errors which makes it practical to store the errors in the output string,<ref name="pep383"/> or replace them with characters from a legacy encoding. Only a small subset of possible byte strings are error-free UTF-8: several bytes cannot appear; a byte with the high bit set cannot be alone; and in a truly random string a byte with a high bit set has only a {{frac|1|15}} chance of starting a valid UTF-8 character. This has the consequence of making it easy to detect if a legacy text encoding is accidentally used instead of UTF-8, making conversion of a system to UTF-8 easier and avoiding the need to require a Byte Order Mark or any other metadata.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)