Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Character encodings in HTML
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Specifying the document's character encoding== There are two general ways to specify which character encoding is used in the document. First, the [[web server]] can include the character encoding or "<code>charset</code>" in the [[Hypertext Transfer Protocol]] (HTTP) <code>Content-Type</code> header, which would typically look like this:<ref>{{citation |chapter-url=http://tools.ietf.org/html/rfc7231#section-3.1.1.5|chapter=Content-Type |title=Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content|publisher=[[IETF]] |date=June 2014 |doi=10.17487/RFC7231 |access-date=2014-07-30|editor-last1=Fielding |editor-last2=Reschke |editor-first1=R |editor-first2=J |last1=Fielding |first1=R. |last2=Reschke |first2=J. |s2cid=14399078 }}</ref> <syntaxhighlight lang="http"> Content-Type: text/html; charset=utf-8 </syntaxhighlight> This method gives the HTTP server a convenient way to alter document's encoding according to [[content negotiation]]; certain HTTP server software can do it, for example Apache with the [[List of Apache modules|module]] <code>mod_charset_lite</code>.<ref>{{cite web| url = http://httpd.apache.org/docs/2.0/en/mod/mod_charset_lite.html| title = Apache Module mod_charset_lite}}</ref> Second, a declaration can be included within the document itself. For HTML it is possible to include this information inside the <code>head</code> element near the top of the document:<ref name=html5charset/> <!-- Please don't add a closing "/": that is incorrect here. --> <syntaxhighlight lang="html"> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </syntaxhighlight> [[HTML5]] also allows the following syntax to mean exactly the same:<ref name=html5charset>{{citation |chapter-url=http://www.w3.org/TR/html5/document-metadata.html#specifying-the-documents-character-encoding |chapter=Specifying the document's character encoding |title=HTML5 |publisher=[[World Wide Web Consortium]] |date=14 December 2017 |access-date=2018-05-28}}</ref> <!-- Please don't add a closing "/": that is unnecessary here. --> <syntaxhighlight lang="html"> <meta charset="utf-8"> </syntaxhighlight> [[XHTML]] documents have a third option: to express the character encoding via [[XML]] declaration, as follows:<ref>{{citation |chapter-url=http://www.w3.org/TR/REC-xml/#sec-prolog-dtd |chapter=Prolog and Document Type Declaration |title=XML |first1=T. |last1=Bray |author-link1=Tim Bray |first2=J. |last2=Paoli |first3=C. |last3=Sperberg-McQueen |author-link3=Michael Sperberg-McQueen |first4=E. |last4=Maler |first5=F. |last5=Yergeau |publisher=[[W3C]] |date=26 November 2008 |access-date=8 March 2010}}</ref> <syntaxhighlight lang="xml"> <?xml version="1.0" encoding="utf-8"?> </syntaxhighlight> With this second approach, because the character encoding cannot be known until the declaration is parsed, there is a problem knowing which character encoding is used in the document up to and including the declaration itself. If the character encoding is an [[ASCII extension]] then the content up to and including the declaration itself should be pure ASCII and this will work correctly. For character encodings that are not ASCII extensions (i.e. not a superset of ASCII), such as [[UTF-16BE]] and [[UTF-16LE]], a processor of HTML, such as a web browser, should be able to parse the declaration in some cases through the use of heuristics. ===Encoding detection algorithm=== As of HTML5 the recommended charset is [[UTF-8]].<ref name=html5charset/> An "encoding sniffing algorithm" is defined in the specification to determine the character encoding of the document based on multiple sources of input, including: # Explicit user instruction # An explicit meta tag within the first 1024 bytes of the document # A [[byte order mark]] (BOM) within the first three bytes of the document # The HTTP Content-Type or other transport layer information # Analysis of the document bytes looking for specific sequences or ranges of byte values,<ref>{{cite web| url = http://www.w3.org/TR/html5/syntax.html#prescan-a-byte-stream-to-determine-its-encoding| title = HTML5 prescan a byte stream to determine its encoding}}</ref> and other tentative detection mechanisms. Characters outside of the printable ASCII range (32 to 126) usually appear incorrectly. This presents few problems for [[English language|English]]-speaking users, but other languages regularly—in some cases, always—require characters outside that range. In Chinese, Japanese, and Korean ([[CJK characters|CJK]]) language environments where there are several different multi-byte encodings in use, auto-detection is also often employed. Finally, browsers usually permit the user to override ''incorrect'' charset label manually as well. It is increasingly common for multilingual websites and websites in non-Western languages to use [[UTF-8]], which allows use of the same encoding for all languages. [[UTF-16]] or [[UTF-32]], which can be used for all languages as well, are less widely used because they can be harder to handle in programming languages that assume a [[byte-oriented]] ASCII superset encoding, and they are less efficient for text with a high frequency of ASCII characters, which is usually the case for HTML documents. Successful viewing of a page is not necessarily an indication that its encoding is specified correctly. If the page's creator and reader are both assuming some platform-specific character encoding, and the server does not send any identifying information, then the reader will nonetheless see the page as the creator intended, but other readers on different platforms or with different native languages will not see the page as intended.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)