Editing UTF-32 (section)

{{Short description|Encoding Unicode characters as 4 bytes per code point}}
'''UTF-32''' (32-[[bit]] [[Unicode transformation format|Unicode Transformation Format]]), sometimes called UCS-4, is a fixed-length [[Character encoding|encoding]] used to encode Unicode [[code point]]s that uses exactly 32 bits (four [[byte]]s) per code point (but a number of leading bits must be zero as there are far fewer than 2<sup>32</sup> Unicode code points, needing actually only 21 bits).<ref name="4_or_3_bytes" /> In contrast, all other Unicode transformation formats are variable-length encodings. Each 32-bit value in UTF-32 represents one Unicode code point and is exactly equal to that code point's numerical value.

The main advantage of UTF-32 is that the Unicode code points are directly indexed. Finding the ''Nth'' code point in a sequence of code points is a [[constant time|constant-time]] operation. In contrast, a [[variable-length code]] requires [[linear time|linear-time]] to count ''N'' code points from the start of the string. This makes UTF-32 a simple replacement in code that uses [[Integer|integers]] that are incremented by one to examine each location in a [[String (computer science)|string]], as was commonly done for [[ASCII]]. However, Unicode code points are rarely processed in complete isolation, such as [[combining character]] sequences and for emoji.<ref name=":0">{{Cite web |title=FAQ - UTF-8, UTF-16, UTF-32 & BOM |url=http://unicode.org/faq/utf_bom.html#utf32-2 |access-date=2022-09-04 |website=Unicode }}</ref>

The main disadvantage of UTF-32 is that it is space-inefficient, using four [[byte]]s per code point, including 11 bits that are always zero. Characters beyond the [[Basic Multilingual Plane|BMP]] are relatively rare in most texts (except, for example, in the case of texts with some popular emojis), and can typically be ignored for sizing estimates. This makes UTF-32 close to twice the size of [[UTF-16]]. It can be up to four times the size of [[UTF-8]] depending on how many of the characters are in the [[ASCII]] subset.<ref name=":0" />