Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Character encodings in HTML
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Permitted encodings== The [[WHATWG]] Encoding Standard, referenced by recent HTML standards (the current WHATWG HTML Living Standard, as well as the formerly competing [[W3C]] HTML 5.0 and 5.1) specifies a list of encodings which browsers must support. The HTML standards forbid support of other encodings.<ref name="html51">{{Cite web |url=https://www.w3.org/TR/html51/syntax.html#character-encodings |title=8.2.2.3. Character encodings |website=HTML 5.1 Standard |publisher=W3C}}</ref><ref name="html50">{{Cite web |url=https://www.w3.org/TR/html5/syntax.html#character-encodings |title=8.2.2.3. Character encodings |website=HTML 5 Standard |publisher=W3C}}</ref><ref name="html5living">{{Cite web |url=https://html.spec.whatwg.org/multipage/parsing.html#character-encodings |title=12.2.3.3 Character encodings |website=HTML Living Standard |publisher=WHATWG}}</ref> The Encoding Standard further stipulates that new formats, new protocols (even when existing formats are used) and authors of new documents are required to use [[UTF-8]] exclusively.<ref name="namesandlabels"/> Besides UTF-8, the following encodings are explicitly listed in the HTML standard itself, with reference to the Encoding Standard:<ref name="html5living"/> {{columns-list|colwidth=12em| * [[ISO-8859-2]] * [[ISO-8859-7]] * [[ISO-8859-8]] * [[Windows-874]]{{efn|Also specified for <code>[[TIS-620]]</code>, <code>[[ISO-8859-11]]</code> and related labels.<ref name="namesandlabels"/>}} * [[Windows-1250]] * [[Windows-1251]] * [[Windows-1252]]{{efn|Also specified for <code>[[ASCII]]</code>, <code>[[ISO-8859-1]]</code> and related labels.<ref name="namesandlabels"/>}} * [[Windows-1254]]{{efn|Also specified for <code>[[ISO-8859-9]]</code> and related labels.<ref name="namesandlabels"/>}} * [[Windows-1255]] * [[Windows-1256]] * [[Windows-1257]] * [[Windows-1258]] * [[GB 18030]]{{efn|Specified with 0xA3A0 as a duplicate encoding of the [[ideographic space]] (U+3000) for compatibility reasons, and as such excluding U+E5E5 (a private use character).<ref name="gbenc"/><ref name="gbindex"/> Also, specified with 0x80 accepted as an alternative encoding of the [[euro sign]] (U+20AC; see [[Windows-936]]).<ref>{{cite web |url=https://encoding.spec.whatwg.org/#gb18030-decoder |title=10.2.1. gb18030 decoder |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> Otherwise, follows the mappings from the 2005 standard.<ref name="gbindex">{{cite web |url=https://encoding.spec.whatwg.org/#index-gb18030 |title=5. Indexes (§ index gb18030) |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}} * [[Big5]]{{efn|[[Hong Kong Supplementary Character Set]] variant,<ref name="encoding_rs"/> although most of the HKSCS extensions (those with lead bytes less than 0xA1) are not included by the encoder, only by the decoder.<ref>{{cite web |url=https://encoding.spec.whatwg.org/#index-big5-pointer |title=5. Indexes (§ index Big5 pointer) |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}} * [[Shift JIS]]{{efn|The specification includes [[IBM]] and [[NEC]] extensions,<ref>{{cite web |url=https://encoding.spec.whatwg.org/#index-jis0208 |title=5. Indexes (§ Index jis0208) |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> and is more precisely [[Windows-31J]].<ref name="encoding_rs">{{cite web |url=https://docs.rs/encoding_rs/latest/encoding_rs/#notable-differences-from-iana-naming |title=Notable Differences from IANA Naming |work=Crate encoding_rs |publisher=docs.rs |author=Mozilla Foundation |author-link=Mozilla Foundation}}</ref>}} * [[ISO-2022-JP]]{{efn|The specification uses the same index as used for Shift JIS (insofar as is within reach), i.e. includes NEC extensions. [[Half-width kana]] is converted to fullwidth by the encoder,<ref>{{cite web |url=https://encoding.spec.whatwg.org/#index-iso-2022-jp-katakana |title=5. Indexes (§ Index ISO-2022-JP katakana) |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> but accepted using an escape sequence (ESC 0x28 0x49) by the decoder.<ref name="whatwgjisdecoder">{{cite web |url=https://encoding.spec.whatwg.org/#iso-2022-jp-decoder |title=12.2.1. ISO-2022-JP decoder |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> [[Shift Out]] and [[Shift In]] (0x0E and 0x0F) are excluded entirely to prevent attacks.<ref name="whatwgjisdecoder" /><ref>{{cite web |url=https://encoding.spec.whatwg.org/#iso-2022-jp-encoder |title=12.2.2. ISO-2022-JP encoder |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}} * [[EUC-KR]]{{efn|Actually [[Unified Hangul Code]] (Windows-949), which is a superset which covers the entire [[Hangul Syllables (Unicode block)|Hangul Syllables]] block.<ref name="encoding_rs"/><ref>{{cite web |url=https://encoding.spec.whatwg.org/#index-euc-kr |title=5. Indexes (§ index EUC-KR) |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}} * [[UTF-16BE]]{{efn|Specified for decoding only; form submissions from UTF-16-coded documents are to be encoded in [[UTF-8]].<ref name="outputenc">{{cite web |url=https://encoding.spec.whatwg.org/#output-encodings |title=4.3. Output encodings |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}} * [[UTF-16LE]]{{efn|For compatibility with deployed content, also specified for the plain <code>[[UTF-16]]</code> label,<ref>{{cite web |url=https://encoding.spec.whatwg.org/#utf-16le |title=14.4. UTF-16LE |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> although a [[byte order mark]] (BOM), if present, takes priority over any label.<ref>{{cite web |url=https://encoding.spec.whatwg.org/#decode |title=6. Hooks for standards (§ decode) |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> Specified for decoding only; form submissions from UTF-16-coded documents are to be encoded in [[UTF-8]].<ref name="outputenc" />}} * x-user-defined{{efn|Maps 0x00 through 0x7F to U+0000 through U+007F, and 0x80 through 0xFF to U+F780 through U+F7FF (a [[Private Use Area]] range), such that the low 8 bits of the code point always match the original byte.<ref>{{cite web |url=https://encoding.spec.whatwg.org/#x-user-defined |title=14.5. x-user-defined |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}} }}{{notelist}} The following additional encodings are listed in the Encoding Standard, and support for them is therefore also required:<ref name="namesandlabels">{{cite web |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren |title=4.2: Names and labels |url=https://encoding.spec.whatwg.org/#names-and-labels}}</ref> {{columns-list|colwidth=12em| * [[Code page 866]] * [[ISO-8859-3]] * [[ISO-8859-4]] * [[ISO-8859-5]] * [[ISO-8859-6]] * [[ISO-8859-8-I|ISO-8859-8-{{serif|I}}]]{{efn|Uses the same encoder and decoder as ISO-8859-8, but is not subject to the visual-order behaviour which is used for documents labelled as ISO-8859-8.<ref>{{cite web |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren |title=9. Legacy single-byte encodings (§ Note) |url=https://encoding.spec.whatwg.org/#ref-for-iso-8859-8%E2%91%A0}}</ref>}} * [[ISO-8859-10]] * [[ISO-8859-13]] * [[ISO-8859-14]] * [[ISO-8859-15]] * [[ISO-8859-16]] * [[KOI8-R]] * [[KOI8-U]] / [[KOI8-RU]]{{efn|Titled KOI8-U and specified for both <code>KOI8-U</code> and <code>KOI8-RU</code> labels;<ref name="namesandlabels"/> follows [[KOI8-RU]] in positions 0xAE and 0xBE (i.e. includes [[Ў|Ў/ў]])<ref name="whatwg-koi8u">{{cite web |url=https://encoding.spec.whatwg.org/koi8-u.html |title=index KOI8-U visualization |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref><ref>{{cite web |url=https://www.w3.org/Bugs/Public/show_bug.cgi?id=17053 |title=Bug 17053: Support KOI8-RU mapping for KOI8-U |date=2015-08-19 |work=[[W3C]] Bugzilla}}</ref> but KOI8-U in positions 0x93–9F.<ref name="whatwg-koi8u"/>}} * [[Mac OS Roman]] * [[Windows-1253]] * [[Mac OS Cyrillic encoding|Mac OS Cyrillic]] * [[GBK (character encoding)|GBK]]{{efn|Also specified for <code>[[GB 2312|GB2312]]</code> and related labels. Handled the same as {{nowrap|GB 18030}} for decoding purposes.<ref>{{cite web |url=https://encoding.spec.whatwg.org/#gbk |title=10.1. GBK |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> For encoding purposes, labelling as GBK (or {{nowrap|GB 2312}}) excludes four-byte codes, and favours the one-byte 0x80 representation for U+20AC.<ref name="gbenc">{{cite web |url=https://encoding.spec.whatwg.org/#gb18030-encoder |title=10.2.2. gb18030 encoder |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}} * [[EUC-JP]]{{efn|The specification uses the same index as used for Shift JIS (insofar as is within reach of the EUC code set 1), i.e. includes NEC extensions. [[JIS X 0212]] is included for decoding only.<ref>{{cite web |url=https://encoding.spec.whatwg.org/#index-jis0212 |title=5. Indexes (§ Index jis0212) |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}} }}{{notelist}} The following encodings are listed as explicit examples of forbidden encodings:<ref name="html5living"/> {{columns-list|colwidth=12em| * [[CESU-8]] * [[UTF-7]] * [[Binary Ordered Compression for Unicode|BOCU-1]] * [[Standard Compression Scheme for Unicode|SCSU]] * [[EBCDIC]] * [[UTF-32]] }} The standard also defines a "replacement" decoder, which maps all content labelled as certain encodings to the [[replacement character]] (�), refusing to process it at all. This is intended to prevent attacks (e.g. [[cross site scripting]]) which may exploit a difference between the client and server in what encodings are supported in order to mask malicious content.<ref>{{cite web |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren |url=https://encoding.spec.whatwg.org/#replacement |title=14.1: replacement}}</ref> Although the same security concern applies to [[ISO-2022-JP]] and [[UTF-16]], which also allow sequences of ASCII bytes to be interpreted differently, this approach was not seen as feasible for them since they are comparatively more frequently used in deployed content.<ref>{{cite web |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren |url=https://encoding.spec.whatwg.org/#security-background |title=2: Security background}}</ref> The following encodings receive this treatment:<ref>{{cite web |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren |title=4.2: Names and labels (§ replacement) |url=https://encoding.spec.whatwg.org/#ref-for-replacement%E2%91%A1}}</ref> {{columns-list|colwidth=12em| * [[ISO-2022-KR]] * [[ISO-2022-CN]] * [[ISO-2022-CN|ISO-2022-CN-EXT]] * [[HZ-GB-2312]] }}
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)