Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
UTF-16
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Usage == UTF-16 is used for text in the OS [[API]] of all currently supported versions of [[Microsoft Windows]]<ref>{{cite web |url=https://learn.microsoft.com/en-us/windows/win32/intl/unicode |title=Unicode |website=[[Microsoft Learn]] |access-date=2011-03-08 |quote=These functions use UTF-16 (wide character) encoding (β¦) used for native Unicode encoding on Windows operating systems.}}</ref> (and including at least [[Windows CE]] since [[Windows CE 5.0]]<ref>{{cite web |url=https://learn.microsoft.com/en-us/previous-versions/windows/embedded/ms904394(v=msdn.10) |title=Working With Unicode Surrogates (Windows CE 5.0) |date=2012-09-14 |website=[[Microsoft Learn]]}}</ref> and [[Windows NT]] since [[Windows 2000]]<ref>{{cite web |url=https://learn.microsoft.com/en-us/windows/win32/intl/surrogates-and-supplementary-characters |title=Surrogates and Supplementary Characters |date=2022-05-24 |website=[[Microsoft Learn]] |quote=Windows 2000 introduces support for basic input, output, and simple sorting of supplementary characters. However, not all system components are compatible with supplementary characters.}}</ref>). Windows 9x and NT prior to Windows 2000 only supported UCS-2.<ref>{{cite web |title=Unicode|publisher=microsoft.com |url=https://msdn.microsoft.com/en-us/library/dd374081.aspx |access-date=2009-07-20}}</ref><ref>{{cite web |title=Surrogates and Supplementary Characters |publisher=microsoft.com |url=https://msdn.microsoft.com/en-us/library/dd374069.aspx |access-date=2009-07-20}}</ref> Since [[Windows 10 version 1903]] (or [[Windows Insider|insider build]] 17035) it has been possible to use UTF-8 in the API,<ref name="Microsoft-UTF-8">{{cite web|title=Use UTF-8 code pages in Windows apps|url=https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page |access-date=2020-06-06 |quote=As of Windows version 1903 (May 2019 update), you can use the ActiveCodePage property in the appxmanifest for packaged apps, or the fusion manifest for unpackaged apps, to force a process to use UTF-8 as the process code page. [...] <code>CP_ACP</code> equates to <code>CP_UTF8</code> only if running on Windows version 1903 (May 2019 update) or above and the ActiveCodePage property described above is set to UTF-8. Otherwise, it honors the legacy system code page. We recommend using <code>CP_UTF8</code> explicitly. |website=learn.microsoft.com |language=en-us}}</ref> though most software, such as [[Windows File Explorer]], still uses UTF-16 API. Microsoft has stated that "UTF-16 [..] is a unique burden that Windows places on code that targets multiple platforms"<ref name="Microsoft GDK">{{Cite web |title=UTF-8 support in the Microsoft Game Development Kit (GDK) - Microsoft Game Development Kit |url=https://learn.microsoft.com/en-us/gaming/gdk/_content/gc/system/overviews/utf-8 |access-date=2023-03-05 |website=learn.microsoft.com |language=en-us |quote=By operating in UTF-8, you can ensure maximum compatibility [..] Windows operates natively in UTF-16 (or WCHAR), which requires code page conversions by using MultiByteToWideChar and WideCharToMultiByte. This is a unique burden that Windows places on code that targets multiple platforms. [..] The Microsoft Game Development Kit (GDK) and Windows in general are moving forward to support UTF-8 to remove this unique burden of Windows on code targeting or interchanging with multiple platforms and the web. Also, this results in fewer internationalization issues in apps and games and reduces the test matrix that's required to get it right.}}</ref> Files and network data tend to be a mix of UTF-16, UTF-8, and legacy byte encodings. [[SMS]] text messaging effectively uses UTF-16. The documentation specifies UCS-2 but UTF-16 is necessary for Emoji to work.<ref>{{cite web|url=https://www.twilio.com/engineering/2012/11/08/adventures-in-unicode-sms|title=Adventures in Unicode SMS|date=2012-11-08|publisher=Twilio|author=Chad Selph|access-date=2015-08-28|archive-url=https://web.archive.org/web/20150908104520/https://www.twilio.com/engineering/2012/11/08/adventures-in-unicode-sms|archive-date=2015-09-08|url-status=dead}}</ref> The [[IBM i]] operating system designates [[CCSID]] ([[code page]]) 13488 for UCS-2 encoding and CCSID 1200 for UTF-16 encoding, though the system treats them both as UTF-16.<ref>{{cite web |url=https://www.ibm.com/support/knowledgecenter/ssw_ibm_i_74/nls/rbagsucs2.htm |title=UCS-2 and its relationship to Unicode (UTF-16) |publisher=[[IBM]] |access-date=2019-04-26}}</ref> UTF-16 is used by the [[Binary Runtime Environment for Wireless|Qualcomm BREW]] operating systems; the [[.NET Framework|.NET]] environments; and the [[Qt (toolkit)|Qt]] cross-platform graphical [[widget toolkit]]. [[Symbian|Symbian OS]] used in Nokia S60 handsets and Sony Ericsson [[UIQ]] handsets uses UCS-2. [[iPhone]] handsets use UTF-16 for [[Short Message Service]] instead of UCS-2 described in the [[GSM 03.38|3GPP TS 23.038]] ([[GSM]]) and IS-637 ([[CDMA2000|CDMA]]) standards.<ref>{{cite web|url=https://www.twilio.com/engineering/2012/11/08/adventures-in-unicode-sms|title=Adventures in Unicode SMS|author-last=Selph|author-first=Chad|date=2012-11-08|publisher=Twilio|archive-url=https://web.archive.org/web/20121109052626/https://www.twilio.com/engineering/2012/11/08/adventures-in-unicode-sms|access-date=2015-08-28|archive-date=2012-11-09}}</ref> The [[Joliet (file system)|Joliet file system]], used in [[CD-ROM]] media, encodes file names using UCS-2BE (up to sixty-four Unicode characters per file name). [[Python (programming language)|Python]] version 2.0 officially only used UCS-2 internally, but the UTF-8 decoder to "Unicode" produced correct UTF-16. There was also the ability to compile Python so that it used UTF-32 internally, this was sometimes done on Unix. Python 3.3 switched internal storage to use one of [[ISO-8859-1]], UCS-2, or UTF-32 depending on the largest code point in the string.<ref>{{cite web |url=https://www.python.org/dev/peps/pep-0393/ |title=PEP 0393 β Flexible String Representation |work=Python.org |access-date=2015-05-29}}</ref> Python 3.12 drops some functionality (for CPython extensions) to make it easier to migrate to [[UTF-8]] for all strings.<ref>{{Cite web |title=PEP 623 β Remove wstr from Unicode {{!}} peps.python.org |url=https://peps.python.org/pep-0623/ |access-date=2023-02-24 |website=peps.python.org}}</ref> [[Java (programming language)|Java]] originally used UCS-2, and added UTF-16 supplementary character support in [[Java Platform, Standard Edition|J2SE 5.0]]. Despite awareness of UTF-8<ref name="Java">{{Cite web |title=JEP 400: UTF-8 by Default |url=https://openjdk.org/jeps/400 |access-date=2023-03-12 |website=openjdk.org}}</ref> all strings are still UTF-16 (since Java 9, strings containing only ISO-8859-1 characters can be "compressed" to bytes<ref>{{Cite web |url=https://www.oracle.com/java/technologies/javase/9-new-features.html|title=JDK 9 Release Notes - New Features}}</ref>). [[JavaScript]] may use UCS-2 or UTF-16.<ref name="mathiasbynens.be">{{Cite web|url=https://mathiasbynens.be/notes/javascript-encoding|title=JavaScript's internal character encoding: UCS-2 or UTF-16? Β· Mathias Bynens|website=mathiasbynens.be}}</ref> As of ES2015, string methods and regular expression flags have been added to the language that permit handling strings from an encoding-agnostic perspective. [[UEFI]] uses UTF-16 to encode strings by default. [[Swift (programming language)|Swift]], Apple's preferred application language, used UTF-16 to store strings until version 5 which switched to UTF-8.<ref>{{Cite web|date=2019-03-20|title=UTF-8 String|url=https://swift.org/blog/utf8-string/|access-date=2020-08-20|website=Swift.org|language=en}}</ref> Quite a few languages make the encoding part of the string object, and thus store and support a large set of encodings including UTF-16. Most consider UTF-16 and UCS-2 to be different encodings. Examples are the [[PHP]] language<ref>{{cite web|url=https://php.net/manual/en/mbstring.supported-encodings.php|title=PHP: Supported Character Encodings - Manual|website=php.net}}</ref> and [[MySQL]].<ref>{{Cite web |title=MySQL :: MySQL 8.0 Reference Manual :: 10.9.2 The utf8mb3 Character Set (3-Byte UTF-8 Unicode Encoding) |url=https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb3.html |access-date=2023-02-24 |website=dev.mysql.com}}</ref> A method to determine what encoding a system is using internally is to ask for the "length" of string containing a single non-BMP character. If the length is 2 then UTF-16 is being used. 4 indicates UTF-8. 3 or 6 may indicate [[CESU-8]]. 1 ''may'' indicate UTF-32, but more likely indicates the language decodes the string to code points before measuring the "length". In many languages, quoted strings need a new syntax for quoting non-BMP characters, as the C-style <code>"\uXXXX"</code> syntax explicitly limits itself to 4 hex digits. The following examples illustrate the syntax for the non-BMP character {{unichar|1D11E|MUSICAL SYMBOL G CLEF}}: * The most common ([[C++]], [[C Sharp (programming language)|C#]], [[D (programming language)|D]], and several other languages) is an upper-case 'U' with ''8'' hex digits such as <code>"\U0001D11E"</code>.<ref>{{cite web |url=http://en.csharp-online.net/ECMA-334:_9.4.1_Unicode_escape_sequences |title=ECMA-334: 9.4.1 Unicode escape sequences|website=en.csharp-online.net |archive-url=https://web.archive.org/web/20130215065218/http://en.csharp-online.net/ECMA-334:_9.4.1_Unicode_escape_sequences |archive-date=2013-02-15}}</ref> * Java 7 regular expressions, [[International Components for Unicode|ICU]], and Perl, use <code>"\x{1D11E}"</code>. * [[ECMAScript]] 2015 (JavaScript) uses <code>"\u{1D11E}"</code>. * In many other cases (such as Java outside of regular expressions),<ref>''Lexical Structure: Unicode Escapes'' in {{cite web |url=https://docs.oracle.com/javase/specs/jls/se6/html/lexical.html#3.3 |title=The Java Language Specification, Third Edition|website = Sun Microsystems, Inc. |year = 2005 | access-date=2019-10-11}}</ref> the only way to get non-BMP characters is to enter the surrogate halves individually: <code>"\uD834\uDD1E"</code>.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)