Editing Round-off error (section)

== Roundoff error caused by floating-point arithmetic ==

Even if some numbers can be represented exactly by floating-point numbers and such numbers are called '''machine numbers''', performing floating-point arithmetic may lead to roundoff error in the final result.

=== Addition ===

Machine addition consists of lining up the decimal points of the two numbers to be added, adding them, and then storing the result again as a floating-point number. The addition itself can be done in higher precision but the result must be rounded back to the specified precision, which may lead to roundoff error.<ref name="Forrester_2018"/>

* For example, adding <math>1</math> to <math>2^{-53}</math> in IEEE double precision as follows,{{Break}}<math>\begin{align}
1.00\ldots 0 \times 2^{0} + 1.00\ldots 0 \times 2^{-53} &= 1.\underbrace{00\ldots 0}_\text{52 bits} \times 2^{0} + 0.\underbrace{00\ldots 0}_\text{52 bits}1 \times 2^{0}\\
&= 1.\underbrace{00\ldots 0}_\text{52 bits}1\times 2^{0}.
\end{align}</math>{{Break}}This is saved as <math>1.\underbrace{00\ldots 0}_\text{52 bits}\times 2^{0}</math> since round-to-nearest is used in IEEE standard. Therefore, <math>1+2^{-53}</math> is equal to <math>1</math> in IEEE double precision and the roundoff error is <math>2^{-53}</math>.

This example shows that roundoff error can be introduced when adding a large number and a small number. The shifting of the decimal points in the significands to make the exponents match causes the loss of some of the less significant digits. The loss of precision may be described as '''absorption'''.<ref>{{cite book |last1=Biran |first1=Adrian B. |last2=Breiner |first2=Moshe |title=What Every Engineer Should Know About MATLAB and Simulink |date=2010 |publisher=[[CRC Press]] |publication-place=[[Boca Raton]], [[Florida]] |isbn=978-1-4398-1023-1 |pages=193–194 |chapter=5}}</ref>

Note that the addition of two floating-point numbers can produce roundoff error when their sum is an order of magnitude greater than that of the larger of the two.

* For example, consider a normalized floating-point number system with base <math>10</math> and precision <math>2</math>. Then <math>fl(62)=6.2 \times 10^{1}</math> and <math>fl(41) = 4.1 \times 10^{1}</math>. Note that <math>62+41=103</math> but <math>fl(103)=1.0 \times 10^{2}</math>. There is a roundoff error of <math>103-fl(103)=3</math>.

This kind of error can occur alongside an absorption error in a single operation.

=== Multiplication ===

In general, the product of two p-digit significands contains up to 2p digits, so the result might not fit in the significand.<ref name="Forrester_2018"/> Thus roundoff error will be involved in the result.

* For example, consider a normalized floating-point number system with the base <math>\beta=10</math> and the significand digits are at most <math>2</math>. Then <math>fl(77) = 7.7 \times 10^{1}</math> and <math>fl(88) = 8.8 \times 10^{1}</math>. Note that <math>77 \times 88=6776</math> but <math>fl(6776) = 6.7 \times 10^{3}</math> since there at most <math>2</math> significand digits. The roundoff error would be <math>6776 - fl(6776)  = 6776 - 6.7 \times 10^{3}=76</math>.

=== Division ===

In general, the quotient of 2p-digit significands may contain more than p-digits.Thus roundoff error will be involved in the result.

* For example, if the normalized floating-point number system above is still being used, then <math>1/3=0.333 \ldots</math> but <math>fl(1/3)=fl(0.333 \ldots)=3.3 \times 10^{-1}</math>. So, the tail <math>0.333 \ldots - 3.3 \times 10^{-1}=0.00333 \ldots </math> is cut off.

=== Subtraction ===

Absorption also applies to subtraction.

* For example, subtracting <math>2^{-60}</math> from <math>1</math> in IEEE double precision as follows, <math display="block">\begin{align}
1.00\ldots 0 \times 2^{0} - 1.00\ldots 0 \times 2^{-60} &= \underbrace{1.00\ldots 0}_\text{60 bits} \times 2^{0} - \underbrace{0.00\ldots 01}_\text{60 bits} \times 2^{0}\\
&= \underbrace{0.11\ldots 1}_\text{60 bits}\times 2^{0}.
\end{align}</math> This is saved as <math>\underbrace{1.00\ldots 0}_\text{53 bits}\times 2^{0}</math> since round-to-nearest is used in IEEE standard. Therefore, <math>1-2^{-60}</math> is equal to <math>1</math> in IEEE double precision and the roundoff error is <math>-2^{-60}</math>.

The subtracting of two nearly equal numbers is called '''subtractive cancellation'''.<ref name="Forrester_2018"/> 
When the leading digits are cancelled, the result may be too small to be represented exactly and it will just be represented as <math>0</math>. 

* For example, let <math>|\epsilon| < \epsilon_\text{mach}</math> and the second definition of machine epsilon is used here.  What is the solution to <math>(1+\epsilon) - (1-\epsilon)</math>?{{Break}} It is known that <math>1+\epsilon</math> and <math>1-\epsilon</math> are nearly equal numbers, and <math>(1+\epsilon) - (1-\epsilon)=1+\epsilon-1+\epsilon=2\epsilon</math>.  However, in the floating-point number system, <math>fl((1+\epsilon) - (1-\epsilon))=fl(1+\epsilon)-fl(1-\epsilon)=1-1=0</math>.  Although <math>2\epsilon</math> is easily big enough to be represented, both instances of <math>\epsilon</math> have been rounded away giving <math>0</math>.

Even with a somewhat larger <math>\epsilon</math>, the result is still significantly unreliable in typical cases.  There is not much faith in the accuracy of the value because the most uncertainty in any floating-point number is the digits on the far right. 

* For example, <math>1.99999 \times 10 ^{2}- 1.99998 \times 10^{2} = 0.00001\times10^{2} =1 \times 10^{-5}\times 10^{2}=1\times10^{-3}</math>. The result <math>1\times10^{-3}</math> is clearly representable, but there is not much faith in it.

This is closely related to the phenomenon of [[catastrophic cancellation]], in which the two numbers are ''known'' to be approximations.