Editing XML (section)

== Characters and escaping ==
XML documents consist entirely of characters from the [[Unicode]] repertoire. Except for a small number of specifically excluded [[control characters]], any character defined by Unicode may appear within the content of an XML document.

XML includes facilities for identifying the ''encoding'' of the Unicode characters that make up the document, and for expressing characters that, for one reason or another, cannot be used directly.

=== Valid characters ===
{{Main|Valid characters in XML}}
Unicode code points in the following ranges are valid in XML 1.0 documents:{{sfnp|Bray|Paoli|Sperberg-McQueen|Maler|2008|loc=section 2.2}}
* U+0009 (Horizontal Tab), U+000A (Line Feed), U+000D (Carriage Return): these are the only [[C0 and C1 control codes|C0]] controls accepted in XML 1.0;
* U+0020–U+D7FF, U+E000–U+FFFD: this excludes some noncharacters in the [[Basic Multilingual Plane|BMP]] (all surrogates, U+FFFE and U+FFFF are forbidden);
* U+10000–U+10FFFF: this includes all code points in supplementary planes, including noncharacters.

XML 1.1 extends the set of allowed characters to include all the above, plus the remaining characters in the range U+0001–U+001F.{{sfnp|Bray|Paoli|Sperberg-McQueen|Maler|2006|loc=section 2.2}} At the same time, however, it restricts the use of C0 and [[C0 and C1 control codes|C1]] control characters other than U+0009 (Horizontal Tab), U+000A (Line Feed), U+000D (Carriage Return), and U+0085 (Next Line) by requiring them to be written in escaped form (for example U+0001 must be written as <code>&amp;#x01;</code> or its equivalent). In the case of C1 characters, this restriction is a backwards incompatibility; it was introduced to allow common encoding errors to be detected.

The code point [[U+0000]] (Null) is the only character that is not permitted in any XML 1.1 document.

=== Encoding detection ===
The Unicode character set can be encoded into [[byte]]s for storage or transmission in a variety of different ways, called "encodings". Unicode itself defines encodings that cover the entire repertoire; well-known ones include [[UTF-8]] (which the XML standard recommends using, without a [[byte order mark|BOM]]) and [[UTF-16]].<ref>{{cite web|last=Bray|first=T.|url=http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF|title=Characters vs. Bytes|website=Tbray.org |date=April 26, 2003 |access-date=16 November 2017}}</ref> There are many other text encodings that predate Unicode, such as [[ASCII]] and various [[ISO/IEC 8859]]; their character repertoires are in every case subsets of the Unicode character set.

XML allows the use of any of the Unicode-defined encodings and any other encodings whose characters also appear in Unicode. XML also provides a mechanism whereby an XML processor can reliably, without any prior knowledge, determine which encoding is being used.{{sfnp|Bray|Paoli|Sperberg-McQueen|Maler|2008|loc=appendix F}} Encodings other than UTF-8 and UTF-16 are not necessarily recognized by every XML parser (and in some cases not even UTF-16, even though the standard mandates it to also be recognized).

=== Escaping ===
XML provides ''[[Escape sequence|escape]]'' facilities for including characters that are problematic to include directly. For example:
* The characters "&lt;" and "&" are key syntax markers and may never appear in content outside a [[CDATA]] section. It is allowed, but not recommended, to use "&lt;" in XML entity values.{{sfnp|Bray|Paoli|Sperberg-McQueen|Maler|2008|loc=section 2.3}}
* Some character encodings support only a subset of Unicode. For example, it is legal to encode an XML document in ASCII, but ASCII lacks code points for Unicode characters such as "é".
* It might not be possible to type the character on the author's machine.
* Some characters have [[homoglyph|glyphs]] that cannot be visually distinguished from other characters, such as the [[nonbreaking space]] (<code>&amp;#xa0;</code>) " " and the [[Space (punctuation)|space]] (<code>&amp;#x20;</code>) " ", and the [[А|Cyrillic capital letter A]] (<code>&amp;#x410;</code>) "А" and the [[A|Latin capital letter A]] (<code>&amp;#x41;</code>) "A".

There are five [[List of XML and HTML character entity references#Predefined entities in XML|predefined entities]]:
* <code>&amp;lt;</code> represents "&lt;";
* <code>&amp;gt;</code> represents "&gt;";
* <code>&amp;amp;</code> represents "&";
* <code>&amp;apos;</code> represents "{{mono|'}}";
* <code>&amp;quot;</code> represents '{{mono|"}}'.

All permitted Unicode characters may be represented with a ''[[numeric character reference]]''. Consider the Chinese character "中", whose numeric code in Unicode is hexadecimal 4E2D, or decimal 20,013. A user whose keyboard offers no method for entering this character could still insert it in an XML document encoded either as <code>&amp;#20013;</code> or <code>&amp;#x4e2d;</code>. Similarly, the string "I &lt;3 Jörg" could be encoded for inclusion in an XML document as <code>I &amp;lt;3 J&amp;#xF6;rg</code>.

<code>&amp;#0;</code> is not permitted because the [[null character]] is one of the control characters excluded from XML, even when using a numeric character reference.<ref>{{cite web|first1=Tex|last1=Texin|first2=François|last2=Yergeau|date=6 September 2003|url=http://www.w3.org/International/questions/qa-controls|title=W3C I18N FAQ: HTML, XHTML, XML and Control Codes|website=W3C Internationalization|publisher=W3C|access-date=16 November 2017}}</ref> An alternative encoding mechanism such as [[Base64]] is needed to represent such characters.

=== Comments ===
Comments may appear anywhere in a document outside other markup. Comments cannot appear before the XML declaration. Comments begin with <code>&lt;!--</code> and end with <code>--&gt;</code>. For compatibility with [[SGML]], the string "--" (double-hyphen) is not allowed inside comments;{{sfnp|Bray|Paoli|Sperberg-McQueen|Maler|2008|loc=section 2.5}} this means comments cannot be nested. The ampersand has no special significance within comments, so entity and character references are not recognized as such, and there is no way to represent characters outside the character set of the document encoding.

An example of a valid comment:
<code>&lt;!--no need to escape &lt;code&gt; &amp; such in comments--&gt;</code>

=== International use ===
{{Contains special characters|Armenian|example}}

XML 1.0 (Fifth Edition) and XML 1.1 support the direct use of almost any [[Unicode]] character in element names, attributes, comments, character data, and processing instructions (other than the ones that have special symbolic meaning in XML itself, such as the less-than sign, "<"). The following is a well-formed XML document including [[Chinese character|Chinese]], [[Armenian alphabet|Armenian]] and [[Cyrillic]] characters:
<syntaxhighlight lang="xml">
<?xml version="1.0" encoding="UTF-8"?>
<俄语 լեզու="ռուսերեն">данные</俄语>
</syntaxhighlight>