Editing Floating-point arithmetic (section)

=== Minimizing the effect of accuracy problems ===
Although individual arithmetic operations of IEEE 754 are guaranteed accurate to within half a [[unit in the last place|ULP]], more complicated formulae can suffer from larger errors for a variety of reasons. The loss of accuracy can be substantial if a problem or its data are [[condition number|ill-conditioned]], meaning that the correct result is hypersensitive to tiny perturbations in its data. However, even functions that are well-conditioned can suffer from large loss of accuracy if an algorithm [[numerical stability|numerically unstable]] for that data is used: apparently equivalent formulations of expressions in a programming language can differ markedly in their numerical stability. One approach to remove the risk of such loss of accuracy is the design and analysis of numerically stable algorithms, which is an aim of the branch of mathematics known as [[numerical analysis]]. Another approach that can protect against the risk of numerical instabilities is the computation of intermediate (scratch) values in an algorithm at a higher precision than the final result requires,<ref name="OliveiraStewart_2006"/> which can remove, or reduce by orders of magnitude,<ref name="Kahan_2005_ARITH17"/> such risk: [[Quadruple-precision floating-point format|IEEE 754 quadruple precision]] and [[extended precision]] are designed for this purpose when computing at double precision.<ref name="Kahan_2011_Debug"/><ref group="nb" name="NB_4"/>

For example, the following algorithm is a direct implementation to compute the function {{math|{{var|A}}({{var|x}}) {{=}} ({{var|x}}−1) / (exp({{var|x}}−1) − 1)}} which is well-conditioned at 1.0,<ref group="nb" name="NB_5"/> however it can be shown to be numerically unstable and lose up to half the significant digits carried by the arithmetic when computed near 1.0.<ref name="Kahan_2001_JavaHurt"/>
<syntaxhighlight lang="c" line highlight="3,7">
double A(double X)
{
        double Y, Z;  // [1]
        Y = X - 1.0;
        Z = exp(Y);
        if (Z != 1.0)
                Z = Y / (Z - 1.0); // [2]
        return Z;
}
</syntaxhighlight>

If, however, intermediate computations are all performed in extended precision (e.g. by setting line [1] to [[C99]] {{code|long double}}), then up to full precision in the final double result can be maintained.<ref group="nb" name="NB_6"/> Alternatively, a numerical analysis of the algorithm reveals that if the following non-obvious change to line [2] is made:
<syntaxhighlight lang="c">
Z = log(Z) / (Z - 1.0);
</syntaxhighlight>
then the algorithm becomes numerically stable and can compute to full double precision.

To maintain the properties of such carefully constructed numerically stable programs, careful handling by the [[compiler]] is required. Certain "optimizations" that compilers might make (for example, reordering operations) can work against the goals of well-behaved software. There is some controversy about the failings of compilers and language designs in this area: C99 is an example of a language where such optimizations are carefully specified to maintain numerical precision. See the external references at the bottom of this article.

A detailed treatment of the techniques for writing high-quality floating-point software is beyond the scope of this article, and the reader is referred to,<ref name="Higham_2002"/><ref name="Kahan_2000_Marketing"/> and the other references at the bottom of this article. Kahan suggests several rules of thumb that can substantially decrease by orders of magnitude<ref name="Kahan_2000_Marketing"/> the risk of numerical anomalies, in addition to, or in lieu of, a more careful numerical analysis. These include: as noted above, computing all expressions and intermediate results in the highest precision supported in hardware (a common rule of thumb is to carry twice the precision of the desired result, i.e. compute in double precision for a final single-precision result, or in double extended or quad precision for up to double-precision results<ref name="Kahan_1981_WhyIEEE"/>); and rounding input data and results to only the precision required and supported by the input data (carrying excess precision in the final result beyond that required and supported by the input data can be misleading, increases storage cost and decreases speed, and the excess bits can affect convergence of numerical procedures:<ref name="Kahan_2001_LN"/> notably, the first form of the iterative example given below converges correctly when using this rule of thumb). Brief descriptions of several additional issues and techniques follow.

As decimal fractions can often not be exactly represented in binary floating-point, such arithmetic is at its best when it is simply being used to measure real-world quantities over a wide range of scales (such as the orbital period of a moon around Saturn or the mass of a [[proton]]), and at its worst when it is expected to model the interactions of quantities expressed as decimal strings that are expected to be exact.<ref name="Kahan_2005_ARITH17"/><ref name="Kahan_2000_Marketing"/> An example of the latter case is financial calculations. For this reason, financial software tends not to use a binary floating-point number representation.<ref name="Speleotrove_2012"/> The "decimal" data type of the [[C Sharp (programming language)|C#]] and [[Python (programming language)|Python]] programming languages, and the decimal formats of the [[IEEE 754-2008]] standard, are designed to avoid the problems of binary floating-point representations when applied to human-entered exact decimal values, and make the arithmetic always behave as expected when numbers are printed in decimal.

Expectations from mathematics may not be realized in the field of floating-point computation. For example, it is known that <math>(x+y)(x-y) = x^2-y^2\,</math>, and that <math>\sin^2{\theta}+\cos^2{\theta} = 1\,</math>, however these facts cannot be relied on when the quantities involved are the result of floating-point computation.

The use of the equality test (<code>if (x==y) ...</code>) requires care when dealing with floating-point numbers. Even simple expressions like <code>0.6/0.2-3==0</code> will, on most computers, fail to be true<ref name="Christiansen_Perl"/> (in IEEE 754 double precision, for example, <code>0.6/0.2 - 3</code> is approximately equal to {{val|-4.44089209850063e-16}}). Consequently, such tests are sometimes replaced with "fuzzy" comparisons (<code>if (abs(x-y) < epsilon) ...</code>, where epsilon is sufficiently small and tailored to the application, such as 1.0E−13). The wisdom of doing this varies greatly, and can require numerical analysis to bound epsilon.<ref name="Higham_2002"/> Values derived from the primary data representation and their comparisons should be performed in a wider, extended, precision to minimize the risk of such inconsistencies due to round-off errors.<ref name="Kahan_2000_Marketing"/> It is often better to organize the code in such a way that such tests are unnecessary. For example, in [[computational geometry]], exact tests of whether a point lies off or on a line or plane defined by other points can be performed using adaptive precision or exact arithmetic methods.<ref name="Shewchuk"/>

Small errors in floating-point arithmetic can grow when mathematical algorithms perform operations an enormous number of times. A few examples are [[matrix inversion]], [[eigenvector]] computation, and differential equation solving. These algorithms must be very carefully designed, using numerical approaches such as [[iterative refinement]], if they are to work well.<ref name="Kahan_1997_Cantilever"/>

Summation of a vector of floating-point values is a basic algorithm in [[Computational science|scientific computing]], and so an awareness of when loss of significance can occur is essential. For example, if one is adding a very large number of numbers, the individual addends are very small compared with the sum. This can lead to loss of significance. A typical addition would then be something like
 3253.671
 +  3.141276
 -----------
 3256.812
The low 3 digits of the addends are effectively lost. Suppose, for example, that one needs to add many numbers, all approximately equal to 3. After 1000 of them have been added, the running sum is about 3000; the lost digits are not regained. The [[Kahan summation algorithm]] may be used to reduce the errors.<ref name="Higham_2002"/>

Round-off error can affect the convergence and accuracy of iterative numerical procedures. As an example, [[Archimedes]] approximated π by calculating the perimeters of polygons inscribing and circumscribing a circle, starting with hexagons, and successively doubling the number of sides. As noted above, computations may be rearranged in a way that is mathematically equivalent but less prone to error ([[numerical analysis]]). Two forms of the recurrence formula for the circumscribed polygon are:{{citation needed|reason=Not obvious formulas|date=June 2016}}
* <math display="inline">t_0 = \frac{1}{\sqrt{3}}</math>
* First form: <math display=inline>t_{i+1} = \frac{\sqrt{t_i^2+1}-1}{t_i}</math>
* Second form: <math display=inline>t_{i+1} = \frac{t_i}{\sqrt{t_i^2+1}+1}</math>
* <math>\pi \sim 6 \times 2^i \times t_i</math>, converging as <math>i \rightarrow \infty</math>

Here is a computation using IEEE "double" (a significand with 53 bits of precision) arithmetic:

  i   6 × 2<sup>i</sup> × t<sub>i</sub>, first form    6 × 2<sup>i</sup> × t<sub>i</sub>, second form
 ---------------------------------------------------------
  0   '''{{Fontcolor|purple|3}}'''.4641016151377543863      '''{{Fontcolor|purple|3}}'''.4641016151377543863
  1   '''{{Fontcolor|purple|3}}'''.2153903091734710173      '''{{Fontcolor|purple|3}}'''.2153903091734723496
  2   '''{{Fontcolor|purple|3.1}}'''596599420974940120      '''{{Fontcolor|purple|3.1}}'''596599420975006733
  3   '''{{Fontcolor|purple|3.14}}'''60862151314012979      '''{{Fontcolor|purple|3.14}}'''60862151314352708
  4   '''{{Fontcolor|purple|3.14}}'''27145996453136334      '''{{Fontcolor|purple|3.14}}'''27145996453689225
  5   '''{{Fontcolor|purple|3.141}}'''8730499801259536      '''{{Fontcolor|purple|3.141}}'''8730499798241950
  6   '''{{Fontcolor|purple|3.141}}'''6627470548084133      '''{{Fontcolor|purple|3.141}}'''6627470568494473
  7   '''{{Fontcolor|purple|3.141}}'''6101765997805905      '''{{Fontcolor|purple|3.141}}'''6101766046906629
  8   '''{{Fontcolor|purple|3.14159}}'''70343230776862      '''{{Fontcolor|purple|3.14159}}'''70343215275928
  9   '''{{Fontcolor|purple|3.14159}}'''37488171150615      '''{{Fontcolor|purple|3.14159}}'''37487713536668
 10   '''{{Fontcolor|purple|3.141592}}'''9278733740748      '''{{Fontcolor|purple|3.141592}}'''9273850979885
 11   '''{{Fontcolor|purple|3.141592}}'''7256228504127      '''{{Fontcolor|purple|3.141592}}'''7220386148377
 12   '''{{Fontcolor|purple|3.1415926}}'''717412858693      '''{{Fontcolor|purple|3.1415926}}'''707019992125
 13   '''{{Fontcolor|purple|3.1415926}}'''189011456060      '''{{Fontcolor|purple|3.14159265}}'''78678454728
 14   '''{{Fontcolor|purple|3.1415926}}'''717412858693      '''{{Fontcolor|purple|3.14159265}}'''46593073709
 15   '''{{Fontcolor|purple|3.14159}}'''19358822321783      '''{{Fontcolor|purple|3.141592653}}'''8571730119
 16   '''{{Fontcolor|purple|3.1415926}}'''717412858693      '''{{Fontcolor|purple|3.141592653}}'''6566394222
 17   '''{{Fontcolor|purple|3.1415}}'''810075796233302      '''{{Fontcolor|purple|3.141592653}}'''6065061913
 18   '''{{Fontcolor|purple|3.1415926}}'''717412858693      '''{{Fontcolor|purple|3.1415926535}}'''939728836
 19   '''{{Fontcolor|purple|3.141}}'''4061547378810956      '''{{Fontcolor|purple|3.1415926535}}'''908393901
 20   '''{{Fontcolor|purple|3.14}}'''05434924008406305      '''{{Fontcolor|purple|3.1415926535}}'''900560168
 21   '''{{Fontcolor|purple|3.14}}'''00068646912273617      '''{{Fontcolor|purple|3.141592653589}}'''8608396
 22   '''{{Fontcolor|purple|3.1}}'''349453756585929919      '''{{Fontcolor|purple|3.141592653589}}'''8122118
 23   '''{{Fontcolor|purple|3.14}}'''00068646912273617      '''{{Fontcolor|purple|3.14159265358979}}'''95552
 24   '''{{Fontcolor|purple|3}}'''.2245152435345525443      '''{{Fontcolor|purple|3.14159265358979}}'''68907
 25                              '''{{Fontcolor|purple|3.14159265358979}}'''62246
 26                              '''{{Fontcolor|purple|3.14159265358979}}'''62246
 27                              '''{{Fontcolor|purple|3.14159265358979}}'''62246
 28                              '''{{Fontcolor|purple|3.14159265358979}}'''62246
               The true value is '''{{Fontcolor|purple|3.14159265358979323846264338327...}}'''

While the two forms of the recurrence formula are clearly mathematically equivalent,<ref group="nb" name="NB_7"/> the first subtracts 1 from a number extremely close to 1, leading to an increasingly problematic loss of [[significant digit]]s. As the recurrence is applied repeatedly, the accuracy improves at first, but then it deteriorates. It never gets better than about 8 digits, even though 53-bit arithmetic should be capable of about 16 digits of precision. When the second form of the recurrence is used, the value converges to 15 digits of precision.