Editing Floating-point arithmetic (section)

==Representable numbers, conversion and rounding {{anchor|Representable numbers}}==

By their nature, all numbers expressed in floating-point format are [[rational number]]s with a terminating expansion in the relevant base (for example, a terminating decimal expansion in base-10, or a terminating binary expansion in base-2). Irrational numbers, such as [[Pi|π]] or <math display=inline>\sqrt{2}</math>, or non-terminating rational numbers, must be approximated. The number of digits (or bits) of precision also limits the set of rational numbers that can be represented exactly. For example, the decimal number 123456789 cannot be exactly represented if only eight decimal digits of precision are available (it would be rounded to one of the two straddling representable values, 12345678&nbsp;×&nbsp;10<sup>1</sup> or 12345679&nbsp;×&nbsp;10<sup>1</sup>), the same applies to [[Repeating decimal|non-terminating digits]] (.{{overline|5}} to be rounded to either .55555555 or .55555556).

When a number is represented in some format (such as a character string) which is not a native floating-point representation supported in a computer implementation, then it will require a conversion before it can be used in that implementation. If the number can be represented exactly in the floating-point format then the conversion is exact. If there is not an exact representation then the conversion requires a choice of which floating-point number to use to represent the original value. The representation chosen will have a different value from the original, and the value thus adjusted is called the ''rounded value''.

Whether or not a rational number has a terminating expansion depends on the base. For example, in base-10 the number 1/2 has a terminating expansion (0.5) while the number 1/3 does not (0.333...). In base-2 only rationals with denominators that are powers of 2 (such as 1/2 or 3/16) are terminating. Any rational with a denominator that has a prime factor other than 2 will have an infinite binary expansion. This means that numbers that appear to be short and exact when written in decimal format may need to be approximated when converted to binary floating-point. For example, the decimal number 0.1 is not representable in binary floating-point of any finite precision; the exact binary representation would have a "1100" sequence continuing endlessly:
: ''e'' = −4; ''s'' = 1100110011001100110011001100110011...,
where, as previously, ''s'' is the significand and ''e'' is the exponent.

When rounded to 24 bits this becomes
: ''e'' = −4; ''s'' = 110011001100110011001101,
which is actually 0.100000001490116119384765625 in decimal.
<!-- Edit/rearrange this if you want, but please leave the 0.1 example in. (My previous reference to pi being more "sophisticated" than 0.1 was admittedly not artful.) I have known professional software engineers (who should have known better!) who believed that numbers with short decimal representations could always be represented exactly. Putting the many f.p. fallacies/superstitions to rest is important. -->

As a further example, the real number [[Pi|π]], represented in binary as an infinite sequence of bits is
: 11.0010010000111111011010101000100010000101101000110000100011010011...
but is
: 11.0010010000111111011011
when approximated by [[rounding]] to a precision of 24 bits.

In binary single-precision floating-point, this is represented as ''s''&nbsp;=&nbsp;1.10010010000111111011011 with ''e''&nbsp;=&nbsp;1.
This has a decimal value of
: '''3.141592'''7410125732421875,
whereas a more accurate approximation of the true value of π is
: '''3.14159265358979323846264338327950'''...
<!-- Before changing the above numbers, please discuss on talk page. Giving the actual value 10 more digits than the single-precision floating-point value is plenty – more digits do not help the reader -->
The result of rounding differs from the true value by about 0.03 parts per million, and matches the decimal representation of π in the first 7 digits. The difference is the [[discretization error]] and is limited by the [[machine epsilon]].

The arithmetical difference between two consecutive representable floating-point numbers which have the same exponent is called a [[unit in the last place]] (ULP). For example, if there is no representable number lying between the representable numbers 1.45A70C22<sub>16</sub> and 1.45A70C24<sub>16</sub>, the ULP is 2×16<sup>−8</sup>, or 2<sup>−31</sup>. For numbers with a base-2 exponent part of 0, i.e. numbers with an absolute value higher than or equal to 1 but lower than 2, an ULP is exactly 2<sup>−23</sup> or about 10<sup>−7</sup> in single precision, and exactly 2<sup>−53</sup> or about 10<sup>−16</sup> in double precision. The mandated behavior of IEEE-compliant hardware is that the result be within one-half of a ULP.

=== Rounding modes ===
Rounding is used when the exact result of a floating-point operation (or a conversion to floating-point format) would need more digits than there are digits in the significand. IEEE 754 requires ''correct rounding'': that is, the rounded result is as if infinitely precise arithmetic was used to compute the value and then rounded (although in implementation only three extra bits are needed to ensure this). There are several different [[rounding]] schemes (or ''rounding modes''). Historically, [[truncation]] was the typical approach. Since the introduction of IEEE 754, the default method (''[[rounding|round to nearest, ties to even]]'', sometimes called Banker's Rounding) is more commonly used. This method rounds the ideal (infinitely precise) result of an arithmetic operation to the nearest representable value, and gives that representation as the result.<ref group="nb" name="NB_1"/> In the case of a tie, the value that would make the significand end in an even digit is chosen. The IEEE 754 standard requires the same rounding to be applied to all fundamental algebraic operations, including square root and conversions, when there is a numeric (non-NaN) result. It means that the results of IEEE 754 operations are completely determined in all bits of the result, except for the representation of NaNs. ("Library" functions such as cosine and log are not mandated.)

Alternative rounding options are also available. IEEE 754 specifies the following rounding modes:
* round to nearest, where ties round to the nearest even digit in the required position (the default and by far the most common mode)
* round to nearest, where ties round away from zero (optional for binary floating-point and commonly used in decimal)
* round up (toward +∞; negative results thus round toward zero)
* round down (toward −∞; negative results thus round away from zero)
* round toward zero (truncation; it is similar to the common behavior of float-to-integer conversions, which convert −3.9 to −3 and 3.9 to 3)

Alternative modes are useful when the amount of error being introduced must be bounded. Applications that require a bounded error are multi-precision floating-point, and [[interval arithmetic]].
The alternative rounding modes are also useful in diagnosing numerical instability: if the results of a subroutine vary substantially between rounding to + and − infinity then it is likely numerically unstable and affected by round-off error.<ref name="Kahan_2006_Mindless"/>

=== Binary-to-decimal conversion with minimal number of digits ===
Converting a double-precision binary floating-point number to a decimal string is a common operation, but an algorithm producing results that are both accurate and minimal did not appear in print until 1990, with Steele and White's Dragon4. Some of the improvements since then include:
* David M. Gay's ''dtoa.c'', a practical open-source implementation of many ideas in Dragon4.<ref name="Gay_1990"/>
* Grisu3, with a 4&times; speedup as it removes the use of [[bignum]]s. Must be used with a fallback, as it fails for ~0.5% of cases.<ref name="Loitsch_2010"/>
* Errol3, an always-succeeding algorithm similar to, but slower than, Grisu3. Apparently not as good as an early-terminating Grisu with fallback.<ref name="mazong"/>
* Ryū, an always-succeeding algorithm that is faster and simpler than Grisu3.<ref name="Adams_2018"/>
* Schubfach, an always-succeeding algorithm that is based on a similar idea to Ryū, developed almost simultaneously and independently.<ref name="Giulietti"/> Performs better than Ryū and Grisu3 in certain benchmarks.<ref name="abolz"/>

Many modern language runtimes use Grisu3 with a Dragon4 fallback.<ref name="double_conversion_2020"/>

=== Decimal-to-binary conversion ===
The problem of parsing a decimal string into a binary FP representation is complex, with an accurate parser not appearing until Clinger's 1990 work (implemented in dtoa.c).<ref name="Gay_1990"/> Further work has likewise progressed in the direction of faster parsing.<ref name="Lemire_2021"/>