Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
UTF-16
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
{{Short description|Variable-width encoding of Unicode, using one or two 16-bit code units}} {{Infobox character encoding | name = UTF-16 | mime = | alias = | image = UTF-16 encoding.svg | caption = Example of Unicode character encoding through UTF-16 | standard = Unicode Standard | classification = [[Unicode Transformation Format]], [[variable-width encoding]] | lang = International | status = | encodes = [[ISO/IEC 10646]] ([[Unicode]]) | extends = UCS-2 | prev = | next = }} '''UTF-16''' ([[16-bit computing|16-bit]] [[Unicode]] Transformation Format) is a [[character encoding]] that supports all 1,112,064 valid [[code point]]s of Unicode.<ref>{{cite book |title=The Unicode Standard |publisher=[[The Unicode Consortium]] |isbn=978-1-936213-01-6 |edition=6.0 |location=Mountain View, California, US |at=3.9 Unicode Encoding Forms |chapter=Conformance |quote=Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF |chapter-url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G7404}}</ref>{{efn|This number is in fact a consequence of the design of UTF-16}} The encoding is [[variable-width encoding|variable-length]] as code points are encoded with one or two {{nobreak|16-bit}} ''code units''. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding now known as '''UCS-2''' (for 2-byte Universal Character Set),<ref name="unicode-6_0">{{Cite book |title=The Unicode Standard, version 6.0 |date=February 2011 |publisher=[[Unicode Consortium]] |isbn=978-1-936213-01-6 |location=Mountain View, CA |pages=573 |chapter=C.2 Encoding Forms in ISO/IEC 10646 |quote=[...] the term UCS-2 should now be considered obsolete. It no longer refers to an encoding form in either 10646 or the Unicode Standard. |chapter-url=https://www.unicode.org/versions/Unicode6.0.0/appC.pdf}}</ref><ref name="ucs-2-utf-16-differences">{{Cite web |title=FAQ: What is the difference between UCS-2 and UTF-16? |url=https://www.unicode.org/faq/utf_bom.html#utf16-11 |archive-url=https://web.archive.org/web/20030818043641/http://www.unicode.org/faq/basic_q.html#23 |archive-date=2003-08-18 |access-date=2024-03-19 |website=unicode.org |quote=UCS-2 is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1 [...]}}</ref> once it became clear that more than 2<sup>16</sup> (65,536) code points were needed,<ref name="Unicode.org/faq">{{cite web|title=What is UTF-16?|url=https://www.unicode.org/faq/utf_bom.html#utf16-1|website=The Unicode Consortium|publisher=Unicode, Inc.|quote=UTF-16 uses a single 16-bit code unit to encode over 60,000 of the most common characters in Unicode <!-- struck out: "most common 63K characters", and a pair of 16-bit code units, called surrogates, to encode the remainder of about 1 million struck out less commonly used characters in Unicode. Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.) Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16.-->|access-date=7 January 2023}}</ref> including most emoji and important [[CJK characters]] such as for personal and place names.<ref name="problems_of_only_BMP">{{Cite web |last=Lunde |first=Ken |date=2022-01-09 |title=2022 Top Ten List: Why Support Beyond-BMP Code Points? |url=https://ken-lunde.medium.com/2022-top-ten-list-why-support-beyond-bmp-code-points-6a946d7735f9 |website=Medium |language=en |quote=I first came up with the idea for this Top Ten List over 10 years ago, which was prompted by some environments that still supported only BMP code points. The idea, of course, was to motivate the developers of such environments to support code points beyond the BMP by providing an enumerated list of reasons to do so. And yes, there are still some environments that support only BMP code points, such as the VivaDesigner app.|access-date=2024-01-07}}</ref> UTF-16 is used by the [[Windows API]], and by many programming environments such as [[Java programming language|Java]] and [[Qt (software)|Qt]]. The variable length character of UTF-16, combined with the fact that most characters are ''not'' variable length (so variable length is rarely tested), has led to many bugs in software, including in Windows itself.<ref name=dialog_bug>{{Cite web |title=Should UTF-16 be considered harmful? |url=https://softwareengineering.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful |access-date=2024-11-20 |website=Software Engineering Stack Exchange |language=en |quote=File names editing in Window dialogs in broken (delete required 2 presses on backspace) }}</ref> UTF-16 is the only encoding (still) allowed on the web that is incompatible with 8-bit [[ASCII]].<ref>{{Cite web|date=2020-06-10|title=HTML Living Standard|url=https://html.spec.whatwg.org/multipage/infrastructure.html#encoding-terminology|access-date=2020-06-15|website=w3.org|quote=<!--Since support for encodings that are not defined in Encoding is prohibited,--> UTF-16 encodings are the only encodings that this specification needs to treat as not being ASCII-compatible encodings.|archive-url=https://web.archive.org/web/20200908111027/https://html.spec.whatwg.org/multipage/infrastructure.html|archive-date=2020-09-08|url-status=deviated}}</ref>{{efn|UTF-32 is also incompatible with ASCII, but is not listed as a web-encoding.<ref>{{Cite web|url=https://encoding.spec.whatwg.org/|title=Encoding Standard|website=encoding.spec.whatwg.org|access-date=2023-04-22}}</ref>}} However it has never gained popularity on the web, where it is declared by under 0.004% of public web pages (and even then, the web pages are most likely also using [[UTF-8]]<!-- In all cases checked, so likely a config problem using UTF-16: e.g. https://w3techs.com/sites/info/progress.com "used on inner pages" https://w3techs.com/sites/info/upseller.com "used on a subdomain" -->).<ref>{{Cite web|url=https://w3techs.com/technologies/details/en-utf16/all/all|title=Usage Statistics of UTF-16 for Websites, September 2024|website=w3techs.com|language=en|access-date=2024-09-03}}</ref> UTF-8, by comparison, gained dominance years ago and accounted for 99% of all web pages by 2025.<ref>{{Cite web|title=Usage Statistics of UTF-8 for Websites, January 2025|url=https://w3techs.com/technologies/details/en-utf8/all/all|access-date=2025-01-07|website=w3techs.com|language=en}}</ref> The [[WHATWG|Web Hypertext Application Technology Working Group (WHATWG)]] considers UTF-8 "the mandatory encoding for all [text]" and that for security reasons browser applications should not use UTF-16.<ref name="mandatory">{{Cite web|url=https://encoding.spec.whatwg.org/#security-background|title=Encoding Standard|website=encoding.spec.whatwg.org|quote=The UTF-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the UTF-8 encoding. [..] The problems outlined here go away when exclusively using UTF-8, which is one of the many reasons that UTF-8 is now the mandatory encoding for all text things on the Web.|language=en|access-date=2018-10-22}}</ref> [[File:Unifont Full Map.png|thumb|310x310px|[[GNU Unifont]] 16.0.01 [[Plane 0]] map]]
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)