Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
UTF-16
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Byte-order encoding schemes == UTF-16 and UCS-2 produce a sequence of 16-bit code units. Since most communication and storage protocols are defined for bytes, and each unit thus takes two 8-bit bytes, the order of the bytes may depend on the [[endianness]] (byte order) of the computer architecture. To assist in recognizing the byte order of code units, '''UTF-16''' allows a [[byte order mark]] (BOM), a code point with the value U+FEFF, to precede the first actual coded value.{{efn|UTF-8 encoding produces byte values strictly less than 0xFE, so either byte in the BOM sequence also identifies the encoding as UTF-16 (assuming that UTF-32 is not expected).}} (U+FEFF is the invisible [[zero-width non-breaking space]]/ZWNBSP character).{{efn|Use of U+FEFF as the character ZWNBSP instead of as a BOM has been deprecated in favor of U+2060 (WORD JOINER); see [https://www.unicode.org/faq/utf_bom.html#BOM Byte Order Mark (BOM) FAQ] at Unicode.org. But if an application interprets an initial BOM as a character, the ZWNBSP character is invisible, so the impact is minimal.}} If the endian architecture of the decoder matches that of the encoder, the decoder detects the 0xFEFF value, but an opposite-endian decoder interprets the BOM as the [[{{Proper name|noncharacter}}]] value U+FFFE reserved for this purpose. This incorrect result provides a hint to perform byte-swapping for the remaining values. If the BOM is missing, RFC 2781 recommends{{efn|{{IETF RFC|2781}} section 4.3 says that if there is no BOM, "the text SHOULD be interpreted as being big-endian." According to section 1.2, the meaning of the term "SHOULD" is governed by {{IETF RFC|2119}}. In that document, section 3 says "... there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course".}} that big-endian (BE) encoding be assumed. In practice, due to Windows using little-endian (LE) order by default, many applications assume little-endian encoding. It is also reliable to detect endianness by looking for null bytes, on the assumption that characters less than U+0100 are very common. If more even bytes (starting at 0) are null, then it is big-endian. The standard also allows the byte order to be stated explicitly by specifying '''UTF-16BE''' or '''UTF-16LE''' as the encoding type. When the byte order is specified explicitly this way, a BOM is specifically ''not'' supposed to be prepended to the text, and a U+FEFF at the beginning should be handled as a ZWNBSP character. Most applications ignore a BOM in all cases despite this rule. For [[Internet]] protocols, [[Internet Assigned Numbers Authority|IANA]] has approved "UTF-16", "UTF-16BE", and "UTF-16LE" as the names for these encodings (the names are case insensitive). The aliases '''UTF_16''' or '''UTF16''' may be meaningful in some programming languages or software applications, but they are not standard names in Internet protocols. Similar designations, '''UCS-2BE''' and '''UCS-2LE''', are used to show versions of '''UCS-2'''.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)