Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Unicode and HTML
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Character encoding determination== In order to correctly process HTML, a web browser must ascertain which Unicode characters are represented by the encoded form of an HTML document. In order to do this, the web browser must know what encoding was used. ===Encoding information=== When a document is transmitted via a [[MIME]] message or a transport that uses MIME content types such as an [[HTTP]] response, the message may signal the encoding via a Content-Type header, such as <code>Content-Type: text/html; charset=UTF-8</code>. Other external means of declaring encoding are permitted but rarely used. If the document uses a [[Comparison of Unicode encodings|Unicode encoding]], the encoding info might also be present in the form of a [[byte order mark]] (BOM). Finally, the encoding can be declared via the HTML syntax. For the <code>text/html</code> serialisation then, as long as the page is encoded in an extension of [[ASCII]] (such as [[UTF-8]], and thus, not if the page is using [[UTF-16]]), a <code>meta</code> element, like <code><meta http-equiv="content-type" content="text/html; charset=UTF-8"></code> or (starting with [[HTML5]]) <code><meta charset="UTF-8"></code> can be used. For HTML pages serialized as XML, then declaration options is to either rely on the encoding default (which for XML documents is UTF-8), or to use an XML encoding declaration. The meta attribute plays no role in HTML served as XML. ===Encoding defaults=== An encoding default applies when there is no external or internal encoding declaration and also no byte order mark. While the encoding default for HTML pages served as XML is required to be UTF-8, the encoding default for a regular Web page (that is: for HTML pages serialized as <code>text/html</code>) varies depending on the localization of the browser. For a system set up mainly for Western European languages, it will generally be [[ISO 8859-1#Windows-1252|Windows-1252]]. For Cyrillic alphabet locales, the default is typically [[Windows-1251]]. For a browser from a location where ''legacy'' multi-byte character encodings are prevalent, some form of auto-detection is likely to be applied. ===Encoding trends=== Because of the legacy of 8-bit text representations in [[programming language]]s and [[operating system]]s and the desire to avoid burdening users with the need to understand the nuances of encoding, many text editors used by HTML authors are unable or unwilling to offer a choice of encodings when saving files to disk and often do not even allow input of characters beyond a very limited range. Consequently, many HTML authors are unaware of encoding issues and may not have any idea what encoding their documents actually use. Misunderstandings, such as the belief that the encoding declaration affects a change in the actual encoding (whereas it is actually just a label that could be inaccurate), is also a reason for this editor attitude. Another factor contributing in the same direction, is the arrival of UTF-8{{snd}} which greatly diminishes the need for other encodings, and thus modern editors tends to default, as recommended by the HTML5 specification,<ref>{{Cite web|url=http://www.w3.org/TR/html5/semantics.html#charset|title=HTML5|author=Ian Hickson|access-date=17 September 2011|year=2011|quote=Authors are encouraged to use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629] Authoring tools should default to using UTF-8 for newly created documents. [RFC3629] }}</ref> to UTF-8. ===Byte order mark/Unicode sniffing=== For both serializations of HTML (content-type "text/html" and content/type "application/xhtml+xml"), the byte order mark (BOM) is an effective way to transmit encoding information within an HTML document. For UTF-8, the BOM is optional, while it is a must for the UTF-16 and the UTF-32 encodings. (Note: UTF-16 and UTF-32 without the BOM are formally known under different names, they are different encodings, and thus needs some form of encoding declaration β see [[UTF-16BE]], [[UTF-16LE]], [[UTF-32LE]] and [[UTF-32BE]].) The use of the BOM character (U+FEFF) means that the encoding automatically declares itself to any processing application. Processing applications need only look for an initial 0x0000FEFF, 0xFEFF or 0xEFBBBF in the byte stream to identify the document as UTF-32, UTF-16 or UTF-8 encoded respectively. No additional metadata mechanisms are required for these encodings since the byte-order mark includes all of the information necessary for processing applications. In most circumstances, the byte-order mark character is handled by editing applications separately from the other characters so there is little risk of an author removing or otherwise changing the byte order mark to indicate the wrong encoding (as can happen when the encoding is declared in English/Latin script). If the document lacks a byte-order mark, the fact that the first non-blank printable character in an HTML document is supposed to be "<" (U+003C) can be used to determine a UTF-8/UTF-16/UTF-32 encoding. ===Encoding overriding=== Many HTML documents are served with inaccurate encoding information, or no encoding information at all. In order to determine the encoding in such cases, many browsers allow the user to manually select an encoding name from a list. They may also employ an encoding auto-detection algorithm that works in concert '''with''' or{{snd}} ''in the case of the BOM and in case of HTML served as XML''{{snd}} '''against''' the manual override. For HTML documents which are <code>text/html</code> serialized, manual override may apply to all documents, or only those for which the encoding cannot be ascertained by looking at declarations and/or byte patterns. The fact that the manual override is present and widely used hinders the adoption of accurate encoding declarations on the Web; therefore the problem is likely to persist. But note that Internet Explorer, Chrome and Safari{{snd}} for both XML and <code>text/html</code> serializations{{snd}} do not permit the encoding to be overridden whenever the page includes the BOM.<ref>{{Cite web |title=12897 β In some parsers, UTF-8 BOM trumps the HTTP charset attribute (Encoding sniffing algorithm) |url=https://www.w3.org/Bugs/Public/show_bug.cgi?id=12897 |access-date=2023-03-09 |website=www.w3.org}}</ref> For HTML documents serialized with the preferred XML label{{snd}} <code>application/xhtml+xml</code>, manual encoding override is not permitted. To override the encoding of such an XML document would mean that the document stopped being XML, as it is a fatal error for XML documents to have an encoding declaration with detectable errors. Currently, Gecko browsers such as Firefox, abide to this rule, whereas the bulk of the other common browsers that support HTML as XML, such as Webkit browsers (Chrome/Safari) <ref>{{Cite web |title=66189 β XML parser doesn't emit FATAL ERROR for all, detectable encoding errors |url=https://bugs.webkit.org/show_bug.cgi?id=66189 |access-date=2023-03-09 |website=bugs.webkit.org}}</ref> do allow the encoding of XHTML documents to be manually overridden.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)