Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
UTF-8
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Description == UTF-8 encodes code points in one to four bytes, depending on the value of the code point. In the following table, the characters {{mono|u}} to {{mono|z}} are replaced by the bits of the code point, from the positions {{mono|U+uvwxyz}}: {| class="wikitable" |+ Code point β UTF-8 conversion |- ! First code point ! Last code point ! Byte 1 ! Byte 2 ! Byte 3 ! Byte 4 |- | style="text-align: right" | {{tt|U+0000}} | style="text-align: right" | {{tt|U+007F}} | {{mono|0yyyzzzz}} | style="background: darkgray" colspan=3 | |- | style="text-align: right" | {{tt|U+0080}} | style="text-align: right" | {{tt|U+07FF}} | {{mono|110xxxyy}} | {{mono|10yyzzzz}} | style="background: darkgray" colspan=2 | |- | style="text-align: right" | {{tt|U+0800}} | style="text-align: right" | {{tt|U+FFFF}} | {{mono|1110wwww}} | {{mono|10xxxxyy}} | {{mono|10yyzzzz}} | style="background: darkgray" | |- | style="text-align: right" | {{tt|U+010000}} | style="text-align: right" | {{tt|U+10FFFF}} | {{mono|11110uvv}} | {{mono|10vvwwww}} | {{mono|10xxxxyy}} | {{mono|10yyzzzz}} |} The first 128 code points (ASCII) need 1 byte. The next 1,920 code points need two bytes to encode, which covers the remainder of almost all [[Latin-script alphabet]]s, and also [[International Phonetic Alphabet|IPA extensions]], [[Greek alphabet|Greek]], [[Cyrillic script|Cyrillic]], [[Coptic alphabet|Coptic]], [[Armenian alphabet|Armenian]], [[Hebrew alphabet|Hebrew]], [[Arabic alphabet|Arabic]], [[Syriac alphabet|Syriac]], [[Thaana]] and [[N'Ko script|N'Ko]] alphabets, as well as [[Combining Diacritical Marks]]. Three bytes are needed for the remaining 61,440 codepoints of the [[Basic Multilingual Plane]] (BMP), including most [[CJK characters|Chinese, Japanese and Korean characters]]. Four bytes are needed for the 1,048,576 non-BMP code points, which include [[emoji]], less common [[CJK characters]], and other useful characters.<ref name="problems_of_only_BMP">{{Cite web |last=Lunde |first=Ken |date=2022-01-09 |title=2022 Top Ten List: Why Support Beyond-BMP Code Points? |url=https://ken-lunde.medium.com/2022-top-ten-list-why-support-beyond-bmp-code-points-6a946d7735f9 |website=Medium |language=en|access-date=2024-01-07}}</ref> UTF-8 is a ''[[prefix code]]'' and it is unnecessary to read past the last byte of a code point to decode it. Unlike many earlier multi-byte text encodings such as [[Shift-JIS]], it is ''[[Self-synchronizing code|self-synchronizing]]'' so searches for short strings or characters are possible and that the start of a code point can be found from a random position by backing up at most 3 bytes. The values chosen for the lead bytes means sorting a list of UTF-8 strings puts them in the same order as sorting [[UTF-32]] strings. === Overlong encodings === {{anchor|overlong encodings}} Using a row in the above table to encode a code point less than "First code point" (thus using more bytes than necessary) is termed an ''overlong encoding''. These are a security problem because they allow character sequences such as malicious JavaScript and <code>[[directory traversal attack|../]]</code> to bypass security validations, which has been reported in numerous high-profile products such as Microsoft's [[Internet Information Services|IIS]] web server<ref name=MS00-078>{{ cite report | first = Marvin |last = Marin | date = 2000-10-17 | title = Windows NT UNICODE vulnerability analysis | department = Web server folder traversal | id = MS00-078 | series = Malware FAQ | website=SANS Institute | url=https://www.sans.org/resources/malwarefaq/wnt-unicode.php | url-status=dead | archive-url=https://web.archive.org/web/20140827001204/http://www.sans.org/security-resources/malwarefaq/wnt-unicode.php | archive-date=Aug 27, 2014 }}</ref> and Apache's Tomcat servlet container.<ref name=CVE-2008-2938>{{ cite web | title = CVE-2008-2938 | year = 2008 | website = National Vulnerability Database (nvd.nist.gov) | publisher = U.S. [[National Institute of Standards and Technology]] | url = https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2008-2938 }}</ref> Overlong encodings should therefore be considered an error and never decoded. [[#Modified UTF-8|Modified UTF-8]] allows an overlong encoding of {{tt|U+0000}}. === Byte map === The chart below gives the detailed meaning of each byte in a stream encoded in UTF-8. {{UTF-8 byte map}} === Error handling === Not all sequences of bytes are valid UTF-8. A UTF-8 decoder should be prepared for: * Bytes that never appear in UTF-8: {{tt|0xC0}}, {{tt|0xC1}}, {{tt|0xF5}}{{ndash}}{{tt|0xFF}} * A "continuation byte" ({{tt|0x80}}{{ndash}}{{tt|0xBF}}) at the start of a character * A non-continuation byte (or the string ending) before the end of a character * An overlong encoding ({{tt|0xE0}} followed by less than {{tt|0xA0}}, or {{tt|0xF0}} followed by less than {{tt|0x90}}) * A 4-byte sequence that decodes to a value greater than {{tt|U+10FFFF}} ({{tt|0xF4}} followed by {{tt|0x90}} or greater) Many of the first UTF-8 decoders would decode these, ignoring incorrect bits. Carefully crafted invalid UTF-8 could make them either skip or create ASCII characters such as {{mono|NUL}}, slash, or quotes, leading to security vulnerabilities. It is also common to throw an exception or truncate the string at an error<ref>{{ cite web | title = DataInput | series = Java Platform SE 8 | website = docs.oracle.com | url = https://docs.oracle.com/javase/8/docs/api/java/io/DataInput.html | access-date = 2021-03-24 }}</ref> but this turns what would otherwise be harmless errors (i.e. "file not found") into a [[denial of service]], for instance early versions of Python 3.0 would exit immediately if the command line or [[environment variable]]s contained invalid UTF-8.<ref name=PEP383>{{ cite web | title = Non-decodable bytes in system character interfaces | date = 2009-04-22 | website = python.org | url = https://www.python.org/dev/peps/pep-0383/ | access-date = 2014-08-13 }}</ref> {{nobr|RFC 3629}} states "Implementations of the decoding algorithm MUST protect against decoding invalid sequences."<ref name="rfc3629">{{cite IETF |title=UTF-8, a transformation format of ISO 10646 |rfc=3629 |std=63 |last1=Yergeau |first1=F. |date=November 2003 |publisher=[[Internet Engineering Task Force|IETF]] |access-date=August 20, 2020}}</ref> ''The Unicode Standard'' requires decoders to: "... treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence."<!-- anyone have a copy of ISO/IEC 10646-1:2000 annex D for comparison? --> The standard now recommends replacing each error with the [[replacement character]] "οΏ½" ({{tt|U+FFFD}}) and continue decoding. Some decoders consider the sequence {{mono|E1,A0,20}} (a truncated 3-byte code followed by a space) as a single error. This is not a good idea as a search for a space character would find the one hidden in the error. Since Unicode 6 (October 2010)<ref>{{ cite report | title = Unicode 6.0.0 | date = October 2010 | website = unicode.org | url = https://www.unicode.org/versions/Unicode6.0.0/ }}</ref> the standard (chapter 3) has recommended a "best practice" where the error is either one continuation byte, or ends at the first byte that is disallowed, so {{mono|E1,A0,20}} is a two-byte error followed by a space. This means an error is no more than three bytes long and never contains the start of a valid character, and there are {{val|21952|fmt=commas}} different possible errors. Technically this makes UTF-8 no longer a [[prefix code]] (the decoder has to read one byte past some errors to figure out they are an error), but searching still works if the searched-for string does not contain any errors. Making each byte be an error, in which case {{mono|E1,A0,20}} is ''two'' errors followed by a space, also still allows searching for a valid string. This means there are only 128 different errors which makes it practical to store the errors in the output string,<ref name="pep383"/> or replace them with characters from a legacy encoding. Only a small subset of possible byte strings are error-free UTF-8: several bytes cannot appear; a byte with the high bit set cannot be alone; and in a truly random string a byte with a high bit set has only a {{frac|1|15}} chance of starting a valid UTF-8 character. This has the consequence of making it easy to detect if a legacy text encoding is accidentally used instead of UTF-8, making conversion of a system to UTF-8 easier and avoiding the need to require a Byte Order Mark or any other metadata. === Surrogates === Since RFC 3629 (November 2003), the high and low surrogates used by [[UTF-16]] ({{tt|U+D800}} through {{tt|U+DFFF}}) are not legal Unicode values, and their UTF-8 encodings must be treated as an invalid byte sequence.<ref name="rfc3629"/> These encodings all start with {{tt|0xED}} followed by {{tt|0xA0}} or higher. This rule is often ignored as surrogates are allowed in Windows filenames and this means there must be a way to store them in a string.<ref name="PEP 529">{{ cite web | title = Change Windows filesystem encoding to UTF-8 | id = PEP 529 | website = Python.org |language = en | url = https://www.python.org/dev/peps/pep-0529/ | access-date = 2022-05-10 }}</ref> UTF-8 that allows these surrogate halves has been (informally) called ''{{visible anchor|WTF-8}}'',<ref name="wtf-8">{{cite web | title = The WTF-8 encoding | url = https://simonsapin.github.io/wtf-8/}}</ref> while another variation that also encodes all non-BMP characters as two surrogates (6 bytes instead of 4) is called ''[[CESU-8]]''. === Byte-order mark === If the Unicode [[byte-order mark]] {{tt|U+FEFF}} is at the start of a UTF-8 file, the first three bytes will be {{mono|0xEF}}, {{mono|0xBB}}, {{mono|0xBF}}. The Unicode Standard neither requires nor recommends the use of the BOM for UTF-8, but warns that it may be encountered at the start of a file trans-coded from another encoding.<ref>{{citation | chapter-url = https://www.unicode.org/versions/Unicode15.0.0/ch02.pdf | title = The Unicode Standard β Version 15.0.0 | chapter = Chapter 2 | page = 39 }}</ref> While ASCII text encoded using UTF-8 is backward compatible with ASCII, this is not true when Unicode Standard recommendations are ignored and a BOM is added. A BOM can confuse software that isn't prepared for it but can otherwise accept UTF-8, e.g. programming languages that permit non-ASCII bytes in [[string literal]]s but not at the start of the file. Nevertheless, there was and still is software that always inserts a BOM when writing UTF-8, and refuses to correctly interpret UTF-8 unless the first character is a BOM (or the file only contains ASCII).<ref>{{Cite web |title=UTF-8 and Unicode FAQ for Unix/Linux |url=https://www.cl.cam.ac.uk/~mgk25/unicode.html}}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)