Editing Unicode and HTML (section)

==HTML document characters==
Web pages are typically [[HTML]] or [[XHTML]] documents. Both types of documents consist, at a fundamental level, of [[character (computing)|character]]s, which are [[grapheme]]s and grapheme-like units, independent of how they manifest in [[computer storage]] systems and [[computer network|network]]s.

An HTML document is a sequence of Unicode characters. More specifically, HTML 4.0 documents are required to consist of characters in the HTML ''document character set'' : a character repertoire wherein each character is assigned a unique, non-negative integer ''code point''. This set is defined in the HTML 4.0 [[Document Type Definition|DTD]], which also establishes the syntax (allowable sequences of characters) that can produce a valid HTML document. The HTML document character set for HTML 4.0 consists of most, but not all, of the characters jointly defined by [[Unicode]] and ISO/IEC 10646: the [[Universal Character Set]] (UCS).

Like HTML documents, an XHTML document is a sequence of Unicode characters. However, an XHTML document is an [[XML]] document, which, while not having an explicit "document character" layer of [[abstraction]], nevertheless relies upon a similar definition of permissible characters that cover most, but not all, of the Unicode/UCS character definitions. The sets used by HTML and XHTML/XML are slightly different, but these differences have little effect on the average document author.

Regardless of whether the document is HTML or XHTML, when stored on a [[file system]] or transmitted over a network, the document's characters are ''encoded'' as a sequence of [[bit]] [[octet (computing)|octet]]s (''[[byte]]s'') according to a particular character encoding. This encoding may either be a [[Unicode Transformation Format]], like [[UTF-8]], that can directly encode any Unicode character, or a legacy encoding, like [[Windows-1252]], that cannot. However, even when using encodings that do not support all Unicode characters, the encoded document may make use of [[numeric character references]]. For example, <code>&amp;#x263A;</code> (☺) is used to indicate a smiling face character in the Unicode character set.

===Character encoding===
In order to support all Unicode characters without resorting to numeric character references, a web page must have an encoding covering all of Unicode. The most popular is [[UTF-8]], where the [[ASCII]] characters, such as English letters, digits, and some other common characters are preserved unchanged against ASCII. This makes HTML code (such as &lt;br> and &lt;/div>) unchanged compared to ASCII. Characters outside the ASCII range are stored in 2–4 bytes. It is also possible to use [[UTF-16]] where most characters are stored as two bytes with varying [[endianness]], which is supported by modern browsers but less commonly used.

===Numeric character references===
{{Main|Numeric character reference}}

In order to work around the limitations of legacy encodings, HTML is designed such that it is possible to represent characters from the whole of Unicode inside an HTML document by using a [[numeric character reference]]: a sequence of characters that explicitly spell out the Unicode code point of the character being represented. A character reference takes the form '''<code>&amp;#</code>'''<var>N</var>'''<code>;</code>''', where <var>N</var> is either a [[decimal]] number for the Unicode code point, or a [[hexadecimal]] number, in which case it must be prefixed by <code>x</code>. The characters that compose the numeric character reference are universally representable in every encoding approved for use on the Internet.{{citation needed|date=June 2022}}

The support for hexadecimal in this context is more recent, so older browsers might have problems displaying characters referenced with hexadecimal numbers{{snd}} but they will probably have a problem displaying Unicode characters above code point 255 anyway. To ensure better compatibility with older browsers, it is still a common practice to convert the hexadecimal code point into a decimal value (for example <code>&amp;#21512;</code> instead of <code>&amp;#x5408;</code>).{{citation needed|date=June 2022}}

===Named character entities===
{{Main|character entity reference}}

In HTML 4, there is a standard set of 252 named ''character entities'' for characters - some common, some obscure - that are either not found in certain character encodings or are markup sensitive in some contexts (for example angle brackets and quotation marks). Although any Unicode character can be referenced by its numeric code point, some HTML document authors prefer to use these named entities instead, where possible, as they are less cryptic and were better supported by early browsers.

Character entities can be included in an HTML document via the use of ''entity references'', which take the form '''<code>&amp;</code>'''<var>EntityName</var>'''<code>;</code>''', where <var>EntityName</var> is the name of the entity. For example, <code>&amp;mdash;</code>, much like <code>&amp;#8212;</code> or <code>&amp;#x2014;</code>, represents {{U+|2014}}: the [[em dash]] character "&mdash;" even if the character encoding used doesn't contain that character.

For the full list, see: [[List of XML and HTML character entity references]].