Editing Unicode (section)

== Adoption ==
{{See also|UTF-8#Implementations and adoption}}
{{Wikibooks|Unicode/Versions}}

Unicode, in the form of [[UTF-8]], has been the most common encoding for the [[World Wide Web]] since 2008.<ref name="markdavis">{{Cite web
 | last          = Davis
 | first         = Mark
 | author-link   = Mark Davis (Unicode)
 | title         = Moving to Unicode 5.1
 | url           = https://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html
 | date          = 2008-05-05
 | website       = Official Google Blog
 | access-date   = 2025-04-12
 | archive-url   =  https://web.archive.org/web/20250401104941/https://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html
 | archive-date  = 2025-04-01
 | url-status    = live
}}</ref> It has near-universal adoption, and much of the non-UTF-8 content is found in other Unicode encodings, e.g. [[UTF-16]]. {{As of|2024}}, UTF-8 accounts for on average 98.3% of all web pages (and 983 of the top 1,000 highest-ranked web pages).<ref name="W3TechsWebEncoding">{{Cite web
 | title         = Usage Survey of Character Encodings broken down by Ranking
 | url           = https://w3techs.com/technologies/cross/character_encoding/ranking
 | date          = 
 | website       = W3Techs
 | access-date   = 2025-04-12
 | language      = en
}}</ref> Although many pages only use [[ASCII]] characters to display content, UTF-8 was designed with 8-bit ASCII as a subset and almost no websites now declare their encoding to only be ASCII instead of UTF-8.<ref>{{Cite web
 | title         = Usage statistics of US-ASCII for websites
 | url           = https://w3techs.com/technologies/details/en-usascii
 | access-date   = 1 November 2020
 | website       = W3Techs
}}</ref> Over a third of the languages tracked have 100% UTF-8 use.

All internet protocols maintained by [[IETF|Internet Engineering Task Force]], e.g. [[File Transfer Protocol|File Transfer Protocol (FTP)]],<ref>{{cite IETF
 | rfc           = 2640
 | author        = B. Curtin
 | title         = Internationalization of the File Transfer Protocol
 | date          = July 1999
 | access-date   = 2025-04-12
}}</ref> have required support for UTF-8 since the publication of {{IETF RFC|2277}} in 1998, which specified that all IETF protocols "MUST be able to use the UTF-8 charset".<ref>{{cite IETF
 | rfc           = 2277
 | bcp           = 18
 | title         = IETF Policy on Character Sets and Languages
 | author        = H. Alvestrand
 | date          = January 1998
 | access-date   = 2025-04-12
 | archive-url   = https://archive.org/details/rfc2277
 | archive-date  = 2023-01-23
 | url-status    = live
}}</ref>

=== Operating systems ===
Unicode has become the dominant scheme for the internal processing and storage of text. Although a great deal of text is still stored in legacy encodings, Unicode is used almost exclusively for building new information processing systems. Early adopters tended to use [[Universal Coded Character Set|UCS-2]] (the fixed-length two-byte obsolete precursor to UTF-16) and later moved to [[UTF-16]] (the variable-length current standard), as this was the least disruptive way to add support for non-BMP characters. The best known such system is [[Windows NT]] (and its descendants, [[Windows 2000|2000]], [[Windows XP|XP]], [[Windows Vista|Vista]], [[Windows 7|7]], [[Windows 8|8]], [[Windows 10|10]], and [[Windows 11|11]]), which uses UTF-16 as the sole internal character encoding. The [[Java virtual machine|Java]] and [[.NET Framework|.NET]] bytecode environments, [[macOS]], and [[KDE]] also use it for internal representation. Partial support for Unicode can be installed on [[Windows 9x]] through the Microsoft Layer for Unicode.

[[UTF-8]] (originally developed for [[Plan 9 from Bell Labs|Plan 9]])<ref>{{Cite web |last=Pike |first=Rob |author-link=Rob Pike |date=30 April 2003 |title=UTF-8 history |url=https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt}}</ref> has become the main storage encoding on most [[Unix-like]] operating systems (though others are also used by some libraries) because it is a relatively easy replacement for traditional [[extended ASCII]] character sets. UTF-8 is also the most common Unicode encoding used in [[HTML]] documents on the [[World Wide Web]].

Multilingual text-rendering engines which use Unicode include [[Uniscribe]] and [[DirectWrite]] for Microsoft Windows, [[ATSUI]] and [[Core Text]] for macOS, and [[Pango]] for [[GTK+]] and the [[GNOME]] desktop.

=== Input methods ===
{{Main|Unicode input}}

Because keyboard layouts cannot have simple key combinations for all characters, several operating systems provide alternative input methods that allow access to the entire repertoire.

[[ISO/IEC 14755]],<ref>{{Cite web
 | title         = ISO/IEC JTC1/SC 18/WG 9 N
 | url           = https://www.cl.cam.ac.uk/~mgk25/volatile/ISO-14755.pdf
 | access-date   = 2025-04-12
 | archive-url   = https://web.archive.org/web/20250122223453/https://www.cl.cam.ac.uk/~mgk25/volatile/ISO-14755.pdf
 | archive-date  = 2025-01-22
 | url-status    = live
}}</ref> which standardises methods for entering Unicode characters from their code points, specifies several methods. There is the ''Basic method'', where a ''beginning sequence'' is followed by the hexadecimal representation of the code point and the ''ending sequence''. There is also a ''screen-selection entry method'' specified, where the characters are listed in a table on a screen, such as with a character map program.

Online tools for finding the code point for a known character include Unicode Lookup<ref>{{Cite web
 | surname       = Hedley
 | given         = Jonathan
 | year          = 2009
 | title         = Unicode Lookup
 | url           = https://unicodelookup.com/
 | access-date   = 2025-04-12
 | archive-url   = https://web.archive.org/web/20250330001809/https://unicodelookup.com/
 | archive-date  = 2025-03-30
 | url-status    = live
}}</ref> by Jonathan Hedley and Shapecatcher<ref>{{Cite web
 | surname       = Milde
 | given         = Benjamin
 | title         = Unicode Character Recognition
 | url           = https://shapecatcher.com/
 | year          = 2025
 | archive-url   = https://web.archive.org/web/20250402224851/https://shapecatcher.com/
 | archive-date  = 2025-04-02
 | url-status    = live
}}</ref> by Benjamin Milde. In Unicode Lookup, one enters a search key (e.g. "fractions"), and a list of corresponding characters with their code points is returned. In Shapecatcher, based on [[Shape context]], one draws the character in a box and a list of characters approximating the drawing, with their code points, is returned.

=== Email ===
{{Main|Unicode and email}}

[[MIME]] defines two different mechanisms for encoding non-ASCII characters in email, depending on whether the characters are in email headers (such as the "Subject:"), or in the text body of the message; in both cases, the original character set is identified as well as a transfer encoding. For email transmission of Unicode, the [[UTF-8]] character set and the [[Base64]] or the [[Quoted-printable]] transfer encoding are recommended, depending on whether much of the message consists of [[ASCII]] characters. The details of the two different mechanisms are specified in the MIME standards and generally are hidden from users of email software.

The IETF has defined<ref>{{cite IETF
 | rfc         = 4952
 | title       = Overview and Framework for Internationalized Email
 | author1     = J. Klensin
 | author2     = Y. Ko
 | date        = July 2007
 | access-date = 17 August 2022
}}</ref><ref>{{cite IETF
 | rfc         = 6530
 | title       = Overview and Framework for Internationalized Email
 | author1     = J. Klensin
 | author2     = Y. Ko
 | date        = February 2012
 | access-date = 17 August 2022
}}</ref> a framework for internationalized email using UTF-8, and has updated<ref>{{cite IETF
 | rfc         = 6531
 | title       = SMTP Extension for Internationalized Email
 | author1     = J. Yao
 | author2     = W. Mao
 | date        = February 2012
 | access-date = 17 August 2022
}}</ref><ref>{{cite IETF
 | rfc         = 6532
 | title       = Internationalized Email Headers
 | author1     = A. Yang
 | author2     = S. Steele
 | author3     = N. Freed
 | date        = February 2012
 | access-date = 17 August 2022
}}</ref><ref>{{cite IETF
 | rfc         = 5255
 | title       = Internet Message Access Protocol Internationalization
 | author1     = C. Newman
 | author2     = A. Gulbrandsen
 | author3     = A. Melnikov
 | date        = June 2008
 | access-date = 17 August 2022
}}</ref><ref>{{cite IETF
 | rfc         = 5721
 | title       = POP3 Support for UTF-8
 | author1     = R. Gellens
 | author2     = C. Newman
 | date        = February 2010
 | access-date = 17 August 2022
}}</ref> several protocols in accordance with that framework.

The adoption of Unicode in email has been very slow.{{citation needed|date=November 2022}} Some East Asian text is still encoded in encodings such as [[ISO-2022]], and some devices, such as mobile phones,{{citation needed|reason=is this outdated?|date=November 2022}} still cannot correctly handle Unicode data. Support has been improving, however. Many major free mail providers such as [[Yahoo! Mail]], [[Gmail]], and [[Outlook.com]] support it.

=== Web ===
{{Main|Unicode and HTML}}

All [[W3C]] recommendations have used Unicode as their ''document character set'' since HTML 4.0. [[Web browser]]s have supported Unicode, especially UTF-8, for many years. There used to be display problems resulting primarily from [[typeface|font]] related issues; e.g. v6 and older of Microsoft [[Internet Explorer]] did not render many code points unless explicitly told to use a font that contains them.<ref>{{Cite web
 | last          = Wood
 | first         = Alan
 | title         = Setting up Windows Internet Explorer 5, 5.5 and 6 for Multilingual and Unicode Support: ''Options for enabling Unicode in Internet Explorer 5, 5.5 and 6: Fonts (IE 5, 5.5 and 6)''
 | url           = https://www.alanwood.net/unicode/explorer.html#ie5
 | publisher     = Alan Wood
 | date          = 2005-09-13
 | access-date   = 2025-04-12
 | archive-url   = https://web.archive.org/web/20250120141644/https://www.alanwood.net/unicode/explorer.html#ie5
 | archive-date  = 2025-01-20
 | url-status    = live
}}</ref>

Although syntax rules may affect the order in which characters are allowed to appear, [[XML]] (including [[XHTML]]) documents, by definition,<ref>{{Cite web
 | title         = Extensible Markup Language (XML) 1.1 (Second Edition)
 | url           = https://www.w3.org/TR/xml11
 | publisher     = [[World Wide Web Consortium]]
 | date          = 2006-09-29
 | access-date   = 2025-04-12
 | archive-url   = https://web.archive.org/web/20250405204806/https://www.w3.org/TR/xml11/
 | archive-date  = 2025-04-05
 | url-status    = live
}}</ref> comprise characters from most of the Unicode code points, with the exception of:

* FFFE or FFFF.
* most of the [[C0 and C1 control codes|C0 control codes]],
* the permanently unassigned code points D800–DFFF,

HTML characters manifest either directly as [[byte]]s according to the document's encoding, if the encoding supports them, or users may write them as numeric character references based on the character's Unicode code point. For example, the references <code>&amp;#916;</code>, <code>&amp;#1049;</code>, <code>&amp;#1511;</code>, <code>&amp;#1605;</code>, <code>&amp;#3671;</code>, <code>&amp;#12354;</code>, <code>&amp;#21494;</code>, <code>&amp;#33865;</code>, and <code>&amp;#47568;</code> (or the same numeric values expressed in hexadecimal, with <code>&amp;#x</code> as the prefix) should display on all browsers as Δ, Й, ק ,م, ๗, あ, 叶, 葉, and 말.

When specifying [[Uniform Resource Identifier|URIs]], for example as [[URL]]s in [[HTTP]] requests, non-ASCII characters must be [[percent encoding|percent-encoded]].

=== Fonts ===
{{Main|Unicode font}}

Unicode is not in principle concerned with fonts ''per se'', seeing them as implementation choices.<ref>{{Cite journal
 | last1         = Bigelow
 | first1        = Charles
 | last2         = Holmes
 | first2        = Kris
 | date          = September 1993
 | title         = The design of a Unicode font
 | url           = http://cajun.cs.nott.ac.uk/wiley/journals/epobetan/pdf/volume6/issue3/bigelow.pdf
 | journal       = Electronic Publishing
 | issn          = 0894-3982
 | volume        = 6
 | issue         = 3
 | page          = 292
 | access-date   = 2025-04-12
 | archive-url   = https://web.archive.org/web/20250216000657/http://cajun.cs.nott.ac.uk/wiley/journals/epobetan/pdf/volume6/issue3/bigelow.pdf
 | archive-date  = 2025-02-16
 | url-status    = live
}}</ref> Any given character may have many [[allograph]]s, from the more common bold, italic and base letterforms to complex decorative styles. A font is "Unicode compliant" if the glyphs in the font can be accessed using code points defined in ''The Unicode Standard''.<ref>{{Cite web
 | title         = FAQs: Fonts and keyboards: ''Fonts and Unicode''
 | url           = https://www.unicode.org/faq/font_keyboard.html
 | date          = 
 | access-date   = 2025-04-12
 | publisher     = [[Unicode Consortium]]
 | archive-url   = https://web.archive.org/web/20250306103512/https://www.unicode.org/faq/font_keyboard.html
 | archive-date  = 2025-03-06
 | url-status    = live
}}</ref> The standard does not specify a minimum number of characters that must be included in the font; some fonts have quite a small repertoire.

Free and retail [[font]]s based on Unicode are widely available, since [[TrueType]] and [[OpenType]] support Unicode (and [[Web Open Font Format]] (WOFF and [[WOFF2]]) is based on those). These font formats map Unicode code points to glyphs, but OpenType and TrueType font files are restricted to 65,535 glyphs. Collection files provide a "gap mode" mechanism for overcoming this limit in a single font file. (Each font within the collection still has the 65,535 limit, however.) A TrueType Collection file would typically have a file extension of ".ttc".

[[List of typefaces|Thousands of fonts]] exist on the market, but fewer than a dozen fonts—sometimes described as "pan-Unicode" fonts—attempt to support the majority of Unicode's character repertoire. Instead, Unicode-based [[List of Unicode fonts|fonts]] typically focus on supporting only basic ASCII and particular scripts or sets of characters or symbols. Several reasons justify this approach: applications and documents rarely need to render characters from more than one or two writing systems; fonts tend to demand resources in computing environments; and operating systems and applications show increasing intelligence in regard to obtaining glyph information from separate font files as needed, i.e., [[font substitution]]. Furthermore, designing a consistent set of rendering instructions for tens of thousands of glyphs constitutes a monumental task; such a venture passes the point of [[diminishing returns]] for most typefaces.

=== Newlines ===
Unicode partially addresses the [[newline]] problem that occurs when trying to read a text file on different platforms. Unicode defines a large number of [[Newline#Unicode|characters]] that conforming applications should recognize as line terminators.

In terms of the newline, Unicode introduced {{unichar|2028|LINE SEPARATOR}} and {{unichar|2029|PARAGRAPH SEPARATOR}}. This was an attempt to provide a Unicode solution to encoding paragraphs and lines semantically, potentially replacing all of the various platform solutions. In doing so, Unicode does provide a way around the historical platform-dependent solutions. Nonetheless, few if any Unicode solutions have adopted these Unicode line and paragraph separators as the sole canonical line ending characters. However, a common approach to solving this issue is through newline normalization. This is achieved with the [[Cocoa text system]] in [[Mac OS X|macOS]] and also with W3C XML and HTML recommendations. In this approach, every possible newline character is converted internally to a common newline (which one does not really matter since it is an internal operation just for rendering). In other words, the text system can correctly treat the character as a newline, regardless of the input's actual encoding.