Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Unicode
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Adoption == {{See also|UTF-8#Implementations and adoption}} {{Wikibooks|Unicode/Versions}} Unicode, in the form of [[UTF-8]], has been the most common encoding for the [[World Wide Web]] since 2008.<ref name="markdavis">{{Cite web | last = Davis | first = Mark | author-link = Mark Davis (Unicode) | title = Moving to Unicode 5.1 | url = https://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html | date = 2008-05-05 | website = Official Google Blog | access-date = 2025-04-12 | archive-url = https://web.archive.org/web/20250401104941/https://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html | archive-date = 2025-04-01 | url-status = live }}</ref> It has near-universal adoption, and much of the non-UTF-8 content is found in other Unicode encodings, e.g. [[UTF-16]]. {{As of|2024}}, UTF-8 accounts for on average 98.3% of all web pages (and 983 of the top 1,000 highest-ranked web pages).<ref name="W3TechsWebEncoding">{{Cite web | title = Usage Survey of Character Encodings broken down by Ranking | url = https://w3techs.com/technologies/cross/character_encoding/ranking | date = | website = W3Techs | access-date = 2025-04-12 | language = en }}</ref> Although many pages only use [[ASCII]] characters to display content, UTF-8 was designed with 8-bit ASCII as a subset and almost no websites now declare their encoding to only be ASCII instead of UTF-8.<ref>{{Cite web | title = Usage statistics of US-ASCII for websites | url = https://w3techs.com/technologies/details/en-usascii | access-date = 1 November 2020 | website = W3Techs }}</ref> Over a third of the languages tracked have 100% UTF-8 use. All internet protocols maintained by [[IETF|Internet Engineering Task Force]], e.g. [[File Transfer Protocol|File Transfer Protocol (FTP)]],<ref>{{cite IETF | rfc = 2640 | author = B. Curtin | title = Internationalization of the File Transfer Protocol | date = July 1999 | access-date = 2025-04-12 }}</ref> have required support for UTF-8 since the publication of {{IETF RFC|2277}} in 1998, which specified that all IETF protocols "MUST be able to use the UTF-8 charset".<ref>{{cite IETF | rfc = 2277 | bcp = 18 | title = IETF Policy on Character Sets and Languages | author = H. Alvestrand | date = January 1998 | access-date = 2025-04-12 | archive-url = https://archive.org/details/rfc2277 | archive-date = 2023-01-23 | url-status = live }}</ref> === Operating systems === Unicode has become the dominant scheme for the internal processing and storage of text. Although a great deal of text is still stored in legacy encodings, Unicode is used almost exclusively for building new information processing systems. Early adopters tended to use [[Universal Coded Character Set|UCS-2]] (the fixed-length two-byte obsolete precursor to UTF-16) and later moved to [[UTF-16]] (the variable-length current standard), as this was the least disruptive way to add support for non-BMP characters. The best known such system is [[Windows NT]] (and its descendants, [[Windows 2000|2000]], [[Windows XP|XP]], [[Windows Vista|Vista]], [[Windows 7|7]], [[Windows 8|8]], [[Windows 10|10]], and [[Windows 11|11]]), which uses UTF-16 as the sole internal character encoding. The [[Java virtual machine|Java]] and [[.NET Framework|.NET]] bytecode environments, [[macOS]], and [[KDE]] also use it for internal representation. Partial support for Unicode can be installed on [[Windows 9x]] through the Microsoft Layer for Unicode. [[UTF-8]] (originally developed for [[Plan 9 from Bell Labs|Plan 9]])<ref>{{Cite web |last=Pike |first=Rob |author-link=Rob Pike |date=30 April 2003 |title=UTF-8 history |url=https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt}}</ref> has become the main storage encoding on most [[Unix-like]] operating systems (though others are also used by some libraries) because it is a relatively easy replacement for traditional [[extended ASCII]] character sets. UTF-8 is also the most common Unicode encoding used in [[HTML]] documents on the [[World Wide Web]]. Multilingual text-rendering engines which use Unicode include [[Uniscribe]] and [[DirectWrite]] for Microsoft Windows, [[ATSUI]] and [[Core Text]] for macOS, and [[Pango]] for [[GTK+]] and the [[GNOME]] desktop. === Input methods === {{Main|Unicode input}} Because keyboard layouts cannot have simple key combinations for all characters, several operating systems provide alternative input methods that allow access to the entire repertoire. [[ISO/IEC 14755]],<ref>{{Cite web | title = ISO/IEC JTC1/SC 18/WG 9 N | url = https://www.cl.cam.ac.uk/~mgk25/volatile/ISO-14755.pdf | access-date = 2025-04-12 | archive-url = https://web.archive.org/web/20250122223453/https://www.cl.cam.ac.uk/~mgk25/volatile/ISO-14755.pdf | archive-date = 2025-01-22 | url-status = live }}</ref> which standardises methods for entering Unicode characters from their code points, specifies several methods. There is the ''Basic method'', where a ''beginning sequence'' is followed by the hexadecimal representation of the code point and the ''ending sequence''. There is also a ''screen-selection entry method'' specified, where the characters are listed in a table on a screen, such as with a character map program. Online tools for finding the code point for a known character include Unicode Lookup<ref>{{Cite web | surname = Hedley | given = Jonathan | year = 2009 | title = Unicode Lookup | url = https://unicodelookup.com/ | access-date = 2025-04-12 | archive-url = https://web.archive.org/web/20250330001809/https://unicodelookup.com/ | archive-date = 2025-03-30 | url-status = live }}</ref> by Jonathan Hedley and Shapecatcher<ref>{{Cite web | surname = Milde | given = Benjamin | title = Unicode Character Recognition | url = https://shapecatcher.com/ | year = 2025 | archive-url = https://web.archive.org/web/20250402224851/https://shapecatcher.com/ | archive-date = 2025-04-02 | url-status = live }}</ref> by Benjamin Milde. In Unicode Lookup, one enters a search key (e.g. "fractions"), and a list of corresponding characters with their code points is returned. In Shapecatcher, based on [[Shape context]], one draws the character in a box and a list of characters approximating the drawing, with their code points, is returned. === Email === {{Main|Unicode and email}} [[MIME]] defines two different mechanisms for encoding non-ASCII characters in email, depending on whether the characters are in email headers (such as the "Subject:"), or in the text body of the message; in both cases, the original character set is identified as well as a transfer encoding. For email transmission of Unicode, the [[UTF-8]] character set and the [[Base64]] or the [[Quoted-printable]] transfer encoding are recommended, depending on whether much of the message consists of [[ASCII]] characters. The details of the two different mechanisms are specified in the MIME standards and generally are hidden from users of email software. The IETF has defined<ref>{{cite IETF | rfc = 4952 | title = Overview and Framework for Internationalized Email | author1 = J. Klensin | author2 = Y. Ko | date = July 2007 | access-date = 17 August 2022 }}</ref><ref>{{cite IETF | rfc = 6530 | title = Overview and Framework for Internationalized Email | author1 = J. Klensin | author2 = Y. Ko | date = February 2012 | access-date = 17 August 2022 }}</ref> a framework for internationalized email using UTF-8, and has updated<ref>{{cite IETF | rfc = 6531 | title = SMTP Extension for Internationalized Email | author1 = J. Yao | author2 = W. Mao | date = February 2012 | access-date = 17 August 2022 }}</ref><ref>{{cite IETF | rfc = 6532 | title = Internationalized Email Headers | author1 = A. Yang | author2 = S. Steele | author3 = N. Freed | date = February 2012 | access-date = 17 August 2022 }}</ref><ref>{{cite IETF | rfc = 5255 | title = Internet Message Access Protocol Internationalization | author1 = C. Newman | author2 = A. Gulbrandsen | author3 = A. Melnikov | date = June 2008 | access-date = 17 August 2022 }}</ref><ref>{{cite IETF | rfc = 5721 | title = POP3 Support for UTF-8 | author1 = R. Gellens | author2 = C. Newman | date = February 2010 | access-date = 17 August 2022 }}</ref> several protocols in accordance with that framework. The adoption of Unicode in email has been very slow.{{citation needed|date=November 2022}} Some East Asian text is still encoded in encodings such as [[ISO-2022]], and some devices, such as mobile phones,{{citation needed|reason=is this outdated?|date=November 2022}} still cannot correctly handle Unicode data. Support has been improving, however. Many major free mail providers such as [[Yahoo! Mail]], [[Gmail]], and [[Outlook.com]] support it. === Web === {{Main|Unicode and HTML}} All [[W3C]] recommendations have used Unicode as their ''document character set'' since HTML 4.0. [[Web browser]]s have supported Unicode, especially UTF-8, for many years. There used to be display problems resulting primarily from [[typeface|font]] related issues; e.g. v6 and older of Microsoft [[Internet Explorer]] did not render many code points unless explicitly told to use a font that contains them.<ref>{{Cite web | last = Wood | first = Alan | title = Setting up Windows Internet Explorer 5, 5.5 and 6 for Multilingual and Unicode Support: ''Options for enabling Unicode in Internet Explorer 5, 5.5 and 6: Fonts (IE 5, 5.5 and 6)'' | url = https://www.alanwood.net/unicode/explorer.html#ie5 | publisher = Alan Wood | date = 2005-09-13 | access-date = 2025-04-12 | archive-url = https://web.archive.org/web/20250120141644/https://www.alanwood.net/unicode/explorer.html#ie5 | archive-date = 2025-01-20 | url-status = live }}</ref> Although syntax rules may affect the order in which characters are allowed to appear, [[XML]] (including [[XHTML]]) documents, by definition,<ref>{{Cite web | title = Extensible Markup Language (XML) 1.1 (Second Edition) | url = https://www.w3.org/TR/xml11 | publisher = [[World Wide Web Consortium]] | date = 2006-09-29 | access-date = 2025-04-12 | archive-url = https://web.archive.org/web/20250405204806/https://www.w3.org/TR/xml11/ | archive-date = 2025-04-05 | url-status = live }}</ref> comprise characters from most of the Unicode code points, with the exception of: * FFFE or FFFF. * most of the [[C0 and C1 control codes|C0 control codes]], * the permanently unassigned code points D800–DFFF, HTML characters manifest either directly as [[byte]]s according to the document's encoding, if the encoding supports them, or users may write them as numeric character references based on the character's Unicode code point. For example, the references <code>&#916;</code>, <code>&#1049;</code>, <code>&#1511;</code>, <code>&#1605;</code>, <code>&#3671;</code>, <code>&#12354;</code>, <code>&#21494;</code>, <code>&#33865;</code>, and <code>&#47568;</code> (or the same numeric values expressed in hexadecimal, with <code>&#x</code> as the prefix) should display on all browsers as Δ, Й, ק ,م, ๗, あ, 叶, 葉, and 말. When specifying [[Uniform Resource Identifier|URIs]], for example as [[URL]]s in [[HTTP]] requests, non-ASCII characters must be [[percent encoding|percent-encoded]]. === Fonts === {{Main|Unicode font}} Unicode is not in principle concerned with fonts ''per se'', seeing them as implementation choices.<ref>{{Cite journal | last1 = Bigelow | first1 = Charles | last2 = Holmes | first2 = Kris | date = September 1993 | title = The design of a Unicode font | url = http://cajun.cs.nott.ac.uk/wiley/journals/epobetan/pdf/volume6/issue3/bigelow.pdf | journal = Electronic Publishing | issn = 0894-3982 | volume = 6 | issue = 3 | page = 292 | access-date = 2025-04-12 | archive-url = https://web.archive.org/web/20250216000657/http://cajun.cs.nott.ac.uk/wiley/journals/epobetan/pdf/volume6/issue3/bigelow.pdf | archive-date = 2025-02-16 | url-status = live }}</ref> Any given character may have many [[allograph]]s, from the more common bold, italic and base letterforms to complex decorative styles. A font is "Unicode compliant" if the glyphs in the font can be accessed using code points defined in ''The Unicode Standard''.<ref>{{Cite web | title = FAQs: Fonts and keyboards: ''Fonts and Unicode'' | url = https://www.unicode.org/faq/font_keyboard.html | date = | access-date = 2025-04-12 | publisher = [[Unicode Consortium]] | archive-url = https://web.archive.org/web/20250306103512/https://www.unicode.org/faq/font_keyboard.html | archive-date = 2025-03-06 | url-status = live }}</ref> The standard does not specify a minimum number of characters that must be included in the font; some fonts have quite a small repertoire. Free and retail [[font]]s based on Unicode are widely available, since [[TrueType]] and [[OpenType]] support Unicode (and [[Web Open Font Format]] (WOFF and [[WOFF2]]) is based on those). These font formats map Unicode code points to glyphs, but OpenType and TrueType font files are restricted to 65,535 glyphs. Collection files provide a "gap mode" mechanism for overcoming this limit in a single font file. (Each font within the collection still has the 65,535 limit, however.) A TrueType Collection file would typically have a file extension of ".ttc". [[List of typefaces|Thousands of fonts]] exist on the market, but fewer than a dozen fonts—sometimes described as "pan-Unicode" fonts—attempt to support the majority of Unicode's character repertoire. Instead, Unicode-based [[List of Unicode fonts|fonts]] typically focus on supporting only basic ASCII and particular scripts or sets of characters or symbols. Several reasons justify this approach: applications and documents rarely need to render characters from more than one or two writing systems; fonts tend to demand resources in computing environments; and operating systems and applications show increasing intelligence in regard to obtaining glyph information from separate font files as needed, i.e., [[font substitution]]. Furthermore, designing a consistent set of rendering instructions for tens of thousands of glyphs constitutes a monumental task; such a venture passes the point of [[diminishing returns]] for most typefaces. === Newlines === Unicode partially addresses the [[newline]] problem that occurs when trying to read a text file on different platforms. Unicode defines a large number of [[Newline#Unicode|characters]] that conforming applications should recognize as line terminators. In terms of the newline, Unicode introduced {{unichar|2028|LINE SEPARATOR}} and {{unichar|2029|PARAGRAPH SEPARATOR}}. This was an attempt to provide a Unicode solution to encoding paragraphs and lines semantically, potentially replacing all of the various platform solutions. In doing so, Unicode does provide a way around the historical platform-dependent solutions. Nonetheless, few if any Unicode solutions have adopted these Unicode line and paragraph separators as the sole canonical line ending characters. However, a common approach to solving this issue is through newline normalization. This is achieved with the [[Cocoa text system]] in [[Mac OS X|macOS]] and also with W3C XML and HTML recommendations. In this approach, every possible newline character is converted internally to a common newline (which one does not really matter since it is an internal operation just for rendering). In other words, the text system can correctly treat the character as a newline, regardless of the input's actual encoding.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)