Editing Floating-point arithmetic (section)

== IEEE 754: floating point in modern computers {{anchor|IEEE 754}} ==
{{Main|IEEE 754}}
{{Floating-point}}

The [[Institute of Electrical and Electronics Engineers|IEEE]] standardized the computer representation for binary floating-point numbers in [[IEEE 754]] (a.k.a. IEC 60559) in 1985. This first standard is followed by almost all modern machines. It was [[IEEE 754-2008 revision|revised in 2008]]. IBM mainframes support [[IBM hexadecimal floating point|IBM's own hexadecimal floating point format]] and IEEE 754-2008 [[decimal floating point]] in addition to the IEEE 754 binary format. The [[Cray T90]] series had an IEEE version, but the [[Cray SV1|SV1]] still uses Cray floating-point format.{{Citation needed|date=July 2020}}

The standard provides for many closely related formats, differing in only a few details. Five of these formats are called ''basic formats'', and others are termed ''extended precision formats'' and ''extendable precision format''. Three formats are especially widely used in computer hardware and languages:{{Citation needed|reason=Possibly wrong for double extended: OK for hardware, but for languages? Note that in C, long double may not correspond to double extended (see 32-bit ARM and PowerPC).|date=July 2020}}
* [[Single-precision floating-point format|Single precision]] (binary32), usually used to represent the "float" [[C data types#Basic types|type in the C language]] family. This is a binary format that occupies 32 bits (4 bytes) and its significand has a precision of 24 bits (about 7 decimal digits).
* [[Double-precision floating-point format|Double precision]] (binary64), usually used to represent the "double" [[C data types#Basic types|type in the C language]] family. This is a binary format that occupies 64 bits (8 bytes) and its significand has a precision of 53 bits (about 16 decimal digits).
* [[Extended precision|Double extended]], also ambiguously called "extended precision" format. This is a binary format that occupies at least 79 bits (80 if the hidden/implicit bit rule is not used) and its significand has a precision of at least 64 bits (about 19 decimal digits). The [[C99]] and [[C11 (C standard revision)|C11]] standards of the C language family, in their annex F ("IEC 60559 floating-point arithmetic"), recommend such an extended format to be provided as "[[long double]]".<ref name="C99"/> A format satisfying the minimal requirements (64-bit significand precision, 15-bit exponent, thus fitting on 80 bits) is provided by the [[x86]] architecture. Often on such processors, this format can be used with "long double", though extended precision is not available with MSVC.<ref name="MSVC"/> For [[Data structure alignment|alignment]] purposes, many tools store this 80-bit value in a 96-bit or 128-bit space.<ref name="GCC"/><ref name="float_128"/> On other processors, "long double" may stand for a larger format, such as quadruple precision,<ref name="ARM_2013_AArch64"/> or just double precision, if any form of extended precision is not available.<ref name="ARM_2013_Compiler"/>

Increasing the precision of the floating-point representation generally reduces the amount of accumulated [[round-off error]] caused by intermediate calculations.<ref name="Kahan_2004"/>
Other IEEE formats include:

* [[Decimal64 floating-point format|Decimal64]] and [[decimal128 floating-point format|decimal128]] floating-point formats. These formats (especially decimal128) are pervasive in financial transactions because, along with the [[Decimal32 floating-point format|decimal32]] format, they allow correct decimal rounding.
* [[Quadruple-precision floating-point format#IEEE 754 quadruple-precision binary floating-point format: binary128|Quadruple precision]] (binary128). This is a binary format that occupies 128 bits (16 bytes) and its significand has a precision of 113 bits (about 34 decimal digits).
* [[Half-precision floating-point format|Half precision]], also called binary16, a 16-bit floating-point value. It is being used in the NVIDIA [[Cg (programming language)|Cg]] graphics language, and in the openEXR standard (where it actually predates the introduction in the IEEE 754 standard).<ref name="OpenEXR"/><ref name="OpenEXR-half"/>

<!--In addition, some platforms use the non-IEEE "double-double" format, where the number is represented as unevaluated sum of two double-precision numbers. It can have some strange properties, unlike other formats. http://aggregate.org/NPAR/iccs2006.pdf-->
Any integer with absolute value less than 2<sup>24</sup> can be exactly represented in the single-precision format, and any integer with absolute value less than 2<sup>53</sup> can be exactly represented in the double-precision format. Furthermore, a wide range of powers of 2 times such a number can be represented. These properties are sometimes used for purely integer data, to get 53-bit integers on platforms that have double-precision floats but only 32-bit integers.

The standard specifies some special values, and their representation: positive [[infinity]] ({{math|+∞}}), negative infinity ({{math|−∞}}), a [[negative zero]] (−0) distinct from ordinary ("positive") zero, and "not a number" values ([[NaN]]s).

Comparison of floating-point numbers, as defined by the IEEE standard, is a bit different from usual integer comparison. Negative and positive zero compare equal, and every NaN compares unequal to every value, including itself. All finite floating-point numbers are strictly smaller than {{math|+∞}} and strictly greater than {{math|−∞}}, and they are ordered in the same way as their values (in the set of real numbers).

=== Internal representation ===
Floating-point numbers are typically packed into a computer datum as the sign bit, the exponent field, and a field for the significand, from left to right. For the [[IEEE 754]] binary formats (basic and extended) that have extant hardware implementations, they are apportioned as follows:

{| class="wikitable" style="text-align:right; border:0"
|-
!rowspan="2" |Format
!colspan="4" |Bits for the encoding<!-- Since this is about the encoding, it should be clear that the number given for the significand below excludes the implicit bit, when this is used. -->
| rowspan="8" style="background:white; border:0"|
!rowspan="2" |Exponent<br>bias
!rowspan="2" |Bits<br>precision
!rowspan="2" |Number of<br>decimal digits
|-
!Sign
!Exponent
!Significand
!Total
|-
|[[Half-precision floating-point format|Half]] (binary16)
|1
|5
|10
|16
|15
|11
|~3.3
|-
|[[Single-precision floating-point format|Single]] (binary32)
|1
|8
|23
|32
|127
|24
|~7.2
|-
|[[Double-precision floating-point format|Double]] (binary64)
|1
|11
|52
|64
|1023
|53
|~15.9
|-
|[[Extended precision#x86 extended-precision format|x86 extended]]
|1
|15
|64
|80
|16383
|64
|~19.2
|-
|[[Quadruple-precision floating-point format|Quadruple]] (binary128)
|1
|15
|112
|128
|16383
|113
|~34.0
|-
|[[Octuple-precision floating-point format|Octuple]] (binary256)
|1
|19
|236
|256
|262143
|237
|~71.3
|}

While the exponent can be positive or negative, in binary formats it is stored as an unsigned number that has a fixed "bias" added to it. Values of all 0s in this field are reserved for the zeros and [[subnormal number]]s; values of all 1s are reserved for the infinities and NaNs. The exponent range for normal numbers is [−126, 127] for single precision, [−1022, 1023] for double, or [−16382, 16383] for quad. Normal numbers exclude subnormal values, zeros, infinities, and NaNs.

In the IEEE binary interchange formats the leading bit of a normalized significand is not actually stored in the computer datum, since it is always 1. It is called the "hidden" or "implicit" bit. Because of this, the single-precision format actually has a significand with 24 bits of precision, the double-precision format has 53, quad has 113, and octuple has 237.

For example, it was shown above that π, rounded to 24 bits of precision, has:
* sign = 0 ; ''e'' = 1 ; ''s'' = 110010010000111111011011 (including the hidden bit)
The sum of the exponent bias (127) and the exponent (1) is 128, so this is represented in the single-precision format as
* 0 10000000 10010010000111111011011 (excluding the hidden bit) = 40490FDB<ref name="IEEE-754_Analysis"/> as a [[hexadecimal]] number.

An example of a layout for [[Single-precision floating-point format|32-bit floating point]] is
[[File:Float example.svg|none]]
and the [[Double-precision floating-point format|64-bit ("double")]] layout is similar.