Editing Floating-point arithmetic (section)

== Accuracy problems ==
<!-- internally linked -->
The fact that floating-point numbers cannot accurately represent all real numbers, and that floating-point operations cannot accurately represent true arithmetic operations, leads to many surprising situations. This is related to the finite [[Precision (computer science)|precision]] with which computers generally represent numbers.

For example, the decimal numbers 0.1 and 0.01 cannot be represented exactly as binary floating-point numbers. In the IEEE 754 binary32 format with its 24-bit significand, the result of attempting to square the approximation to 0.1 is neither 0.01 nor the representable number closest to it. The decimal number 0.1 is represented in binary as {{math|1={{var|e}}&nbsp;=&nbsp;−4}}; {{math|1={{var|s}}&nbsp;=&nbsp;110011001100110011001101}}, which is
{{block indent|1=0.100000001490116119384765625 exactly.}}
Squaring this number gives
{{block indent|1=0.010000000298023226097399174250313080847263336181640625 exactly.}}
Squaring it with rounding to the 24-bit precision gives
{{block indent|0.010000000707805156707763671875 exactly.}}
But the representable number closest to 0.01 is
{{block indent|0.009999999776482582092285156250 exactly.}}

Also, the non-representability of π (and π/2) means that an attempted computation of tan(π/2) will not yield a result of infinity, nor will it even overflow in the usual floating-point formats (assuming an accurate implementation of tan). It is simply not possible for standard floating-point hardware to attempt to compute tan(π/2), because π/2 cannot be represented exactly. This computation in C:
<syntaxhighlight lang="c">
/* Enough digits to be sure we get the correct approximation. */
double pi = 3.1415926535897932384626433832795;
double z = tan(pi/2.0);
</syntaxhighlight>
will give a result of 16331239353195370.0. In single precision (using the <code>tanf</code> function), the result will be −22877332.0.

By the same token, an attempted computation of sin(π) will not yield zero. The result will be (approximately) 0.1225{{e|−15}} in double precision, or −0.8742{{e|−7}} in single precision.<ref group="nb" name="NB_3"/>

While floating-point addition and multiplication are both [[commutative]] ({{math|{{var|a}} + {{var|b}} {{=}} {{var|b}} + {{var|a}}}} and {{math|{{var|a}} × {{var|b}} {{=}} {{var|b}} × {{var|a}}}}), they are not necessarily [[Associative property|associative]]. That is, {{math|({{var|a}} + {{var|b}}) + {{var|c}}}} is not necessarily equal to {{math|{{var|a}} + ({{var|b}} + {{var|c}})}}. Using 7-digit significand decimal arithmetic:
  a = 1234.567, b = 45.67834, c = 0.0004

  (a + b) + c:
      1234.567   (a)
    +   45.67834 (b)
    ____________
      1280.24534   rounds to   1280.245

     1280.245  (a + b)
    +   0.0004 (c)
    ____________
     1280.2454   rounds to   '''1280.245'''  ← (a + b) + c

  a + (b + c):
    45.67834 (b)
  +  0.0004  (c)
  ____________
    45.67874

    1234.567   (a)
  +   45.67874   (b + c)
  ____________
    1280.24574   rounds to   '''1280.246''' ← a + (b + c)

They are also not necessarily [[distributive property|distributive]]. That is, {{math|({{var|a}} + {{var|b}}) × {{var|c}}}} may not be the same as {{math|{{var|a}} × {{var|c}} + {{var|b}} × {{var|c}}}}:
  1234.567 × 3.333333 = 4115.223
  1.234567 × 3.333333 = 4.115223
                        4115.223 + 4.115223 = 4119.338
  but
  1234.567 + 1.234567 = 1235.802
                        1235.802 × 3.333333 = 4119.340

In addition to loss of significance, inability to represent numbers such as π and 0.1 exactly, and other slight inaccuracies, the following phenomena may occur:
{{ulist
|1= [[Catastrophic cancellation|Cancellation]]: subtraction of nearly equal operands may cause extreme loss of accuracy.<ref name="Harris"/><ref name="Sierra_1962"/> When we subtract two almost equal numbers we set the most significant digits to zero, leaving ourselves with just the insignificant, and most erroneous, digits.<ref name="Muller_2010"/>{{rp|124}} For example, when determining a [[derivative]] of a function the following formula is used:
<math display=block>Q(h) = \frac{f(a + h) - f(a)}{h}.</math>
Intuitively one would want an {{math|{{var|h}}}} very close to zero; however, when using floating-point operations, the smallest number will not give the best approximation of a derivative. As {{math|{{var|h}}}} grows smaller, the difference between {{math|{{var|f}}({{var|a}} + {{var|h}})}} and {{math|{{var|f}}({{var|a}})}} grows smaller, cancelling out the most significant and least erroneous digits and making the most erroneous digits more important. As a result the smallest number of {{math|{{var|h}}}} possible will give a more erroneous approximation of a derivative than a somewhat larger number. This is perhaps the most common and serious accuracy problem.
|2=Conversions to integer are not intuitive: converting (63.0/9.0) to integer yields 7, but converting (0.63/0.09) may yield 6. This is because conversions generally truncate rather than round. [[Floor and ceiling functions]] may produce answers which are off by one from the intuitively expected value.
|3=Limited exponent range: results might overflow yielding infinity, or underflow yielding a [[subnormal number]] or zero. In these cases precision will be lost.
|4=Testing for [[Division by zero#Computer arithmetic|safe division]] is problematic: Checking that the divisor is not zero does not guarantee that a division will not overflow.
|5=Testing for equality is problematic. Two computational sequences that are mathematically equal may well produce different floating-point values.<ref name="Barker"/>
}}

=== Incidents ===
* On 25 February 1991, a [[loss of significance]] in a [[MIM-104 Patriot]] missile battery [[MIM-104 Patriot#Failure at Dhahran|prevented it from intercepting]] an incoming [[Al Hussein (missile)|Scud]] missile in [[Dhahran]], [[Saudi Arabia]], contributing to the death of 28 soldiers from the U.S. Army's [[14th Quartermaster Detachment]].<ref name="GAO report IMTEC 92-26"/> The error was actually introduced by a [[Fixed-point arithmetic|fixed-point]] computation,<ref name="Skeel"/> but the underlying issue would have been the same with floating-point arithmetic.
* {{Clarify|date=November 2024|reason=It is not clear how this is an incident (the section title may have to be modified to cover more than incidents) and how this is due to floating-point arithmetic (rather than number approximations in general). The term 'invisible' may also be misleading without following explanations. |text=[[Salami slicing tactics#Financial schemes|Salami slicing]] is the practice of removing the 'invisible' part of a transaction into a separate account.}}

=== Machine precision and backward error analysis ===
''Machine precision'' is a quantity that characterizes the accuracy of a floating-point system, and is used in [[error analysis (mathematics)#Error analysis in numerical modeling|backward error analysis]] of floating-point algorithms. It is also known as unit roundoff or ''[[machine epsilon]]''. Usually denoted {{math|{{var|Ε}}<sub>mach</sub>}}, its value depends on the particular rounding being used.

With rounding to zero,
<math display=block>\Epsilon_\text{mach} = B^{1-P},\,</math>
whereas rounding to nearest,
<math display=block>\Epsilon_\text{mach} = \tfrac{1}{2} B^{1-P},</math>
where ''B'' is the base of the system and ''P'' is the precision of the significand (in base ''B'').

This is important since it bounds the ''[[relative error]]'' in representing any non-zero real number {{math|{{var|x}}}} within the normalized range of a floating-point system:
<math display=block>\left| \frac{\operatorname{fl}(x) - x}{x} \right| \le \Epsilon_\text{mach}.</math>

Backward error analysis, the theory of which was developed and popularized by [[James H. Wilkinson]], can be used to establish that an algorithm implementing a numerical function is numerically stable.<ref name="RalstonReilly2003"/> The basic approach is to show that although the calculated result, due to roundoff errors, will not be exactly correct, it is the exact solution to a nearby problem with slightly perturbed input data. If the perturbation required is small, on the order of the uncertainty in the input data, then the results are in some sense as accurate as the data "deserves". The algorithm is then defined as ''[[numerical stability#Forward, backward, and mixed stability|backward stable]]''. Stability is a measure of the sensitivity to rounding errors of a given numerical procedure; by contrast, the [[condition number]] of a function for a given problem indicates the inherent sensitivity of the function to small perturbations in its input and is independent of the implementation used to solve the problem.<ref name="Einarsson_2005"/>

As a trivial example, consider a simple expression giving the inner product of (length two) vectors <math>x</math> and <math>y</math>, then
<math display=block>\begin{align}
  \operatorname{fl}(x \cdot y)
    &= \operatorname{fl}\big(\operatorname{fl}(x_1 \cdot y_1) + \operatorname{fl}(x_2 \cdot y_2)\big), && \text{ where } \operatorname{fl}() \text{ indicates correctly rounded floating-point arithmetic} \\
    &= \operatorname{fl}\big((x_1 \cdot y_1)(1 + \delta_1) + (x_2 \cdot y_2)(1 + \delta_2)\big), && \text{ where } \delta_n \leq \Epsilon_\text{mach}, \text{ from above} \\
    &= \big((x_1 \cdot y_1)(1 + \delta_1) + (x_2 \cdot y_2)(1 + \delta_2)\big)(1 + \delta_3) \\
    &= (x_1 \cdot y_1)(1 + \delta_1)(1 + \delta_3) + (x_2 \cdot y_2)(1 + \delta_2)(1 + \delta_3),
\end{align}</math>
and so
<math display=block>\operatorname{fl}(x \cdot y) = \hat{x} \cdot \hat{y},</math>

where

<math display=block>\begin{align}
\hat{x}_1 &= x_1(1 + \delta_1); & \hat{x}_2 &= x_2(1 + \delta_2);\\
\hat{y}_1 &= y_1(1 + \delta_3); & \hat{y}_2 &= y_2(1 + \delta_3),\\
\end{align}</math>

where

<math display=block>\delta_n \leq \Epsilon_\text{mach}</math>

by definition, which is the sum of two slightly perturbed (on the order of Ε<sub>mach</sub>) input data, and so is backward stable. For more realistic examples in [[numerical linear algebra]], see Higham 2002<ref name="Higham_2002"/> and other references below.

=== Minimizing the effect of accuracy problems ===
Although individual arithmetic operations of IEEE 754 are guaranteed accurate to within half a [[unit in the last place|ULP]], more complicated formulae can suffer from larger errors for a variety of reasons. The loss of accuracy can be substantial if a problem or its data are [[condition number|ill-conditioned]], meaning that the correct result is hypersensitive to tiny perturbations in its data. However, even functions that are well-conditioned can suffer from large loss of accuracy if an algorithm [[numerical stability|numerically unstable]] for that data is used: apparently equivalent formulations of expressions in a programming language can differ markedly in their numerical stability. One approach to remove the risk of such loss of accuracy is the design and analysis of numerically stable algorithms, which is an aim of the branch of mathematics known as [[numerical analysis]]. Another approach that can protect against the risk of numerical instabilities is the computation of intermediate (scratch) values in an algorithm at a higher precision than the final result requires,<ref name="OliveiraStewart_2006"/> which can remove, or reduce by orders of magnitude,<ref name="Kahan_2005_ARITH17"/> such risk: [[Quadruple-precision floating-point format|IEEE 754 quadruple precision]] and [[extended precision]] are designed for this purpose when computing at double precision.<ref name="Kahan_2011_Debug"/><ref group="nb" name="NB_4"/>

For example, the following algorithm is a direct implementation to compute the function {{math|{{var|A}}({{var|x}}) {{=}} ({{var|x}}−1) / (exp({{var|x}}−1) − 1)}} which is well-conditioned at 1.0,<ref group="nb" name="NB_5"/> however it can be shown to be numerically unstable and lose up to half the significant digits carried by the arithmetic when computed near 1.0.<ref name="Kahan_2001_JavaHurt"/>
<syntaxhighlight lang="c" line highlight="3,7">
double A(double X)
{
        double Y, Z;  // [1]
        Y = X - 1.0;
        Z = exp(Y);
        if (Z != 1.0)
                Z = Y / (Z - 1.0); // [2]
        return Z;
}
</syntaxhighlight>

If, however, intermediate computations are all performed in extended precision (e.g. by setting line [1] to [[C99]] {{code|long double}}), then up to full precision in the final double result can be maintained.<ref group="nb" name="NB_6"/> Alternatively, a numerical analysis of the algorithm reveals that if the following non-obvious change to line [2] is made:
<syntaxhighlight lang="c">
Z = log(Z) / (Z - 1.0);
</syntaxhighlight>
then the algorithm becomes numerically stable and can compute to full double precision.

To maintain the properties of such carefully constructed numerically stable programs, careful handling by the [[compiler]] is required. Certain "optimizations" that compilers might make (for example, reordering operations) can work against the goals of well-behaved software. There is some controversy about the failings of compilers and language designs in this area: C99 is an example of a language where such optimizations are carefully specified to maintain numerical precision. See the external references at the bottom of this article.

A detailed treatment of the techniques for writing high-quality floating-point software is beyond the scope of this article, and the reader is referred to,<ref name="Higham_2002"/><ref name="Kahan_2000_Marketing"/> and the other references at the bottom of this article. Kahan suggests several rules of thumb that can substantially decrease by orders of magnitude<ref name="Kahan_2000_Marketing"/> the risk of numerical anomalies, in addition to, or in lieu of, a more careful numerical analysis. These include: as noted above, computing all expressions and intermediate results in the highest precision supported in hardware (a common rule of thumb is to carry twice the precision of the desired result, i.e. compute in double precision for a final single-precision result, or in double extended or quad precision for up to double-precision results<ref name="Kahan_1981_WhyIEEE"/>); and rounding input data and results to only the precision required and supported by the input data (carrying excess precision in the final result beyond that required and supported by the input data can be misleading, increases storage cost and decreases speed, and the excess bits can affect convergence of numerical procedures:<ref name="Kahan_2001_LN"/> notably, the first form of the iterative example given below converges correctly when using this rule of thumb). Brief descriptions of several additional issues and techniques follow.

As decimal fractions can often not be exactly represented in binary floating-point, such arithmetic is at its best when it is simply being used to measure real-world quantities over a wide range of scales (such as the orbital period of a moon around Saturn or the mass of a [[proton]]), and at its worst when it is expected to model the interactions of quantities expressed as decimal strings that are expected to be exact.<ref name="Kahan_2005_ARITH17"/><ref name="Kahan_2000_Marketing"/> An example of the latter case is financial calculations. For this reason, financial software tends not to use a binary floating-point number representation.<ref name="Speleotrove_2012"/> The "decimal" data type of the [[C Sharp (programming language)|C#]] and [[Python (programming language)|Python]] programming languages, and the decimal formats of the [[IEEE 754-2008]] standard, are designed to avoid the problems of binary floating-point representations when applied to human-entered exact decimal values, and make the arithmetic always behave as expected when numbers are printed in decimal.

Expectations from mathematics may not be realized in the field of floating-point computation. For example, it is known that <math>(x+y)(x-y) = x^2-y^2\,</math>, and that <math>\sin^2{\theta}+\cos^2{\theta} = 1\,</math>, however these facts cannot be relied on when the quantities involved are the result of floating-point computation.

The use of the equality test (<code>if (x==y) ...</code>) requires care when dealing with floating-point numbers. Even simple expressions like <code>0.6/0.2-3==0</code> will, on most computers, fail to be true<ref name="Christiansen_Perl"/> (in IEEE 754 double precision, for example, <code>0.6/0.2 - 3</code> is approximately equal to {{val|-4.44089209850063e-16}}). Consequently, such tests are sometimes replaced with "fuzzy" comparisons (<code>if (abs(x-y) < epsilon) ...</code>, where epsilon is sufficiently small and tailored to the application, such as 1.0E−13). The wisdom of doing this varies greatly, and can require numerical analysis to bound epsilon.<ref name="Higham_2002"/> Values derived from the primary data representation and their comparisons should be performed in a wider, extended, precision to minimize the risk of such inconsistencies due to round-off errors.<ref name="Kahan_2000_Marketing"/> It is often better to organize the code in such a way that such tests are unnecessary. For example, in [[computational geometry]], exact tests of whether a point lies off or on a line or plane defined by other points can be performed using adaptive precision or exact arithmetic methods.<ref name="Shewchuk"/>

Small errors in floating-point arithmetic can grow when mathematical algorithms perform operations an enormous number of times. A few examples are [[matrix inversion]], [[eigenvector]] computation, and differential equation solving. These algorithms must be very carefully designed, using numerical approaches such as [[iterative refinement]], if they are to work well.<ref name="Kahan_1997_Cantilever"/>

Summation of a vector of floating-point values is a basic algorithm in [[Computational science|scientific computing]], and so an awareness of when loss of significance can occur is essential. For example, if one is adding a very large number of numbers, the individual addends are very small compared with the sum. This can lead to loss of significance. A typical addition would then be something like
 3253.671
 +  3.141276
 -----------
 3256.812
The low 3 digits of the addends are effectively lost. Suppose, for example, that one needs to add many numbers, all approximately equal to 3. After 1000 of them have been added, the running sum is about 3000; the lost digits are not regained. The [[Kahan summation algorithm]] may be used to reduce the errors.<ref name="Higham_2002"/>

Round-off error can affect the convergence and accuracy of iterative numerical procedures. As an example, [[Archimedes]] approximated π by calculating the perimeters of polygons inscribing and circumscribing a circle, starting with hexagons, and successively doubling the number of sides. As noted above, computations may be rearranged in a way that is mathematically equivalent but less prone to error ([[numerical analysis]]). Two forms of the recurrence formula for the circumscribed polygon are:{{citation needed|reason=Not obvious formulas|date=June 2016}}
* <math display="inline">t_0 = \frac{1}{\sqrt{3}}</math>
* First form: <math display=inline>t_{i+1} = \frac{\sqrt{t_i^2+1}-1}{t_i}</math>
* Second form: <math display=inline>t_{i+1} = \frac{t_i}{\sqrt{t_i^2+1}+1}</math>
* <math>\pi \sim 6 \times 2^i \times t_i</math>, converging as <math>i \rightarrow \infty</math>

Here is a computation using IEEE "double" (a significand with 53 bits of precision) arithmetic:

  i   6 × 2<sup>i</sup> × t<sub>i</sub>, first form    6 × 2<sup>i</sup> × t<sub>i</sub>, second form
 ---------------------------------------------------------
  0   '''{{Fontcolor|purple|3}}'''.4641016151377543863      '''{{Fontcolor|purple|3}}'''.4641016151377543863
  1   '''{{Fontcolor|purple|3}}'''.2153903091734710173      '''{{Fontcolor|purple|3}}'''.2153903091734723496
  2   '''{{Fontcolor|purple|3.1}}'''596599420974940120      '''{{Fontcolor|purple|3.1}}'''596599420975006733
  3   '''{{Fontcolor|purple|3.14}}'''60862151314012979      '''{{Fontcolor|purple|3.14}}'''60862151314352708
  4   '''{{Fontcolor|purple|3.14}}'''27145996453136334      '''{{Fontcolor|purple|3.14}}'''27145996453689225
  5   '''{{Fontcolor|purple|3.141}}'''8730499801259536      '''{{Fontcolor|purple|3.141}}'''8730499798241950
  6   '''{{Fontcolor|purple|3.141}}'''6627470548084133      '''{{Fontcolor|purple|3.141}}'''6627470568494473
  7   '''{{Fontcolor|purple|3.141}}'''6101765997805905      '''{{Fontcolor|purple|3.141}}'''6101766046906629
  8   '''{{Fontcolor|purple|3.14159}}'''70343230776862      '''{{Fontcolor|purple|3.14159}}'''70343215275928
  9   '''{{Fontcolor|purple|3.14159}}'''37488171150615      '''{{Fontcolor|purple|3.14159}}'''37487713536668
 10   '''{{Fontcolor|purple|3.141592}}'''9278733740748      '''{{Fontcolor|purple|3.141592}}'''9273850979885
 11   '''{{Fontcolor|purple|3.141592}}'''7256228504127      '''{{Fontcolor|purple|3.141592}}'''7220386148377
 12   '''{{Fontcolor|purple|3.1415926}}'''717412858693      '''{{Fontcolor|purple|3.1415926}}'''707019992125
 13   '''{{Fontcolor|purple|3.1415926}}'''189011456060      '''{{Fontcolor|purple|3.14159265}}'''78678454728
 14   '''{{Fontcolor|purple|3.1415926}}'''717412858693      '''{{Fontcolor|purple|3.14159265}}'''46593073709
 15   '''{{Fontcolor|purple|3.14159}}'''19358822321783      '''{{Fontcolor|purple|3.141592653}}'''8571730119
 16   '''{{Fontcolor|purple|3.1415926}}'''717412858693      '''{{Fontcolor|purple|3.141592653}}'''6566394222
 17   '''{{Fontcolor|purple|3.1415}}'''810075796233302      '''{{Fontcolor|purple|3.141592653}}'''6065061913
 18   '''{{Fontcolor|purple|3.1415926}}'''717412858693      '''{{Fontcolor|purple|3.1415926535}}'''939728836
 19   '''{{Fontcolor|purple|3.141}}'''4061547378810956      '''{{Fontcolor|purple|3.1415926535}}'''908393901
 20   '''{{Fontcolor|purple|3.14}}'''05434924008406305      '''{{Fontcolor|purple|3.1415926535}}'''900560168
 21   '''{{Fontcolor|purple|3.14}}'''00068646912273617      '''{{Fontcolor|purple|3.141592653589}}'''8608396
 22   '''{{Fontcolor|purple|3.1}}'''349453756585929919      '''{{Fontcolor|purple|3.141592653589}}'''8122118
 23   '''{{Fontcolor|purple|3.14}}'''00068646912273617      '''{{Fontcolor|purple|3.14159265358979}}'''95552
 24   '''{{Fontcolor|purple|3}}'''.2245152435345525443      '''{{Fontcolor|purple|3.14159265358979}}'''68907
 25                              '''{{Fontcolor|purple|3.14159265358979}}'''62246
 26                              '''{{Fontcolor|purple|3.14159265358979}}'''62246
 27                              '''{{Fontcolor|purple|3.14159265358979}}'''62246
 28                              '''{{Fontcolor|purple|3.14159265358979}}'''62246
               The true value is '''{{Fontcolor|purple|3.14159265358979323846264338327...}}'''

While the two forms of the recurrence formula are clearly mathematically equivalent,<ref group="nb" name="NB_7"/> the first subtracts 1 from a number extremely close to 1, leading to an increasingly problematic loss of [[significant digit]]s. As the recurrence is applied repeatedly, the accuracy improves at first, but then it deteriorates. It never gets better than about 8 digits, even though 53-bit arithmetic should be capable of about 16 digits of precision. When the second form of the recurrence is used, the value converges to 15 digits of precision.

=== "Fast math" optimization ===
The aforementioned lack of [[Associative property|associativity]] of floating-point operations in general means that [[compilers]] cannot as effectively reorder arithmetic expressions as they could with integer and fixed-point arithmetic, presenting a roadblock in optimizations such as [[common subexpression elimination]] and auto-[[Single instruction, multiple data|vectorization]].<ref name="Vectorizers"/> The "fast math" option on many compilers (ICC, GCC, Clang, MSVC...) turns on reassociation along with unsafe assumptions such as a lack of NaN and infinite numbers in IEEE 754. Some compilers also offer more granular options to only turn on reassociation. In either case, the programmer is exposed to many of the precision pitfalls mentioned above for the portion of the program using "fast" math.<ref name="FPM"/>

In some compilers (GCC and Clang), turning on "fast" math may cause the program to [[Subnormal_number#Disabling_subnormal_floats_at_the_code_level|disable subnormal floats]] at startup, affecting the floating-point behavior of not only the generated code, but also any program using such code as a [[Library (computing)|library]].<ref name="harmful"/>

In most [[Fortran]] compilers, as allowed by the ISO/IEC 1539-1:2004 Fortran standard, reassociation is the default, with breakage largely prevented by the "protect parens" setting (also on by default). This setting stops the compiler from reassociating beyond the boundaries of parentheses.<ref name="Gen"/> [[Intel Fortran Compiler]] is a notable outlier.<ref name="zheevd"/>

A common problem in "fast" math is that subexpressions may not be optimized identically from place to place, leading to unexpected differences. One interpretation of the issue is that "fast" math as implemented currently has a poorly defined semantics. One attempt at formalizing "fast" math optimizations is seen in ''Icing'', a verified compiler.<ref name="Becker-Darulova-Myreen-Tatlock_2019"/>