Editing Round-off error (section)

== Roundoff error under different rounding rules ==

There are two common rounding rules, round-by-chop and round-to-nearest. The IEEE standard uses round-to-nearest.

* '''Round-by-chop''': The base-<math>\beta</math> expansion of <math>x</math> is truncated after the <math>(p-1)</math>-th digit. 
** This rounding rule is biased because it always moves the result toward zero.
* '''Round-to-nearest''': <math>fl(x)</math> is set to the nearest floating-point number to <math>x</math>. When there is a tie, the floating-point number whose last stored digit is even (also, the last digit, in binary form, is equal to 0) is used. 
** For IEEE standard where the base <math>\beta</math> is <math>2</math>, this means when there is a tie it is rounded so that the last digit is equal to <math>0</math>. 
** This rounding rule is more accurate but more computationally expensive. 
** Rounding so that the last stored digit is even when there is a tie ensures that it is not rounded up or down systematically. This is to try to avoid the possibility of an unwanted slow drift in long calculations due simply to a biased rounding.
* The following example illustrates the level of roundoff error under the two rounding rules.<ref name="Forrester_2018"/> The rounding rule, round-to-nearest, leads to less roundoff error in general. 
{| class="wikitable" style="margin:1em auto"
! x
! Round-by-chop
! Roundoff Error
! Round-to-nearest
! Roundoff Error
|-
|1.649 || 1.6 || 0.049 || 1.6 || 0.049
|-
|1.650 || 1.6 || 0.050 || 1.6 || 0.050
|-
|1.651 || 1.6 || 0.051 || 1.7 || −0.049 
|-
|1.699 || 1.6 || 0.099 || 1.7 || −0.001
|-
|1.749 || 1.7 || 0.049 || 1.7 || 0.049
|-
|1.750 || 1.7 || 0.050 || 1.8 || −0.050
|}

=== Calculating roundoff error in IEEE standard ===

Suppose the usage of round-to-nearest and IEEE double precision.

* Example: the decimal number <math>(9.4)_{10}=(1001.{\overline{0110}})_{2}</math> can be rearranged into <math display="block">+1.\underbrace{0010110011001100110011001100110011001100110011001100}_\text{52 bits}110 \ldots \times 2^{3}</math> 
Since the 53rd bit to the right of the binary point is a 1 and is followed by other nonzero bits, the round-to-nearest rule requires rounding up, that is, add 1 bit to the 52nd bit. Thus, the normalized floating-point representation in IEEE standard of 9.4 is 
<math display="block">fl(9.4)=1.0010110011001100110011001100110011001100110011001101 \times 2^{3}.</math>

* Now the roundoff error can be calculated when representing <math>9.4</math> with <math>fl(9.4)</math>. 
This representation is derived by discarding the infinite tail <math display="block">0.{\overline{1100}} \times 2^{-52}\times 2^{3} = 0.{\overline{0110}} \times 2^{-51} \times 2^{3}=0.4 \times 2^{-48}</math> 
from the right tail and then added <math>1 \times 2^{-52} \times 2^{3}=2^{-49}</math> in the rounding step. 
:Then <math>fl(9.4) = 9.4-0.4 \times 2^{-48} + 2^{-49} = 9.4+(0.2)_{10} \times 2^{-49}</math>. 
:Thus, the roundoff error is <math>(0.2 \times 2^{-49})_{10}</math>.


=== Measuring roundoff error by using machine epsilon ===

The machine epsilon <math>\epsilon_\text{mach}</math> can be used to measure the level of roundoff error when using the two rounding rules above. Below are the formulas and corresponding proof.<ref name="Forrester_2018"/> The first definition of machine epsilon is used here.

==== Theorem ====
# Round-by-chop: <math>\epsilon_\text{mach} = \beta^{1-p}</math>
# Round-to-nearest: <math>\epsilon_\text{mach} = \frac{1}{2}\beta^{1-p}</math>

==== Proof ====
Let <math>x=d_{0}.d_{1}d_{2} \ldots d_{p-1}d_{p} \ldots \times \beta^{n} \in \mathbb{R}</math> where <math>n \in [L, U]</math>, and let <math>fl(x)</math> be the floating-point representation of <math>x</math>.  
Since round-by-chop is being used, it is
<math display="block"> \begin{align}
\frac{|x-fl(x)|}{|x|} &= \frac{|d_{0}.d_{1}d_{2}\ldots d_{p-1}d_{p}d_{p+1}\ldots \times \beta^{n} - d_{0}.d_{1}d_{2}\ldots d_{p-1} \times \beta^{n}|}{|d_{0}.d_{1}d_{2}\ldots \times \beta^{n}|}\\
&= \frac{|d_{p}.d_{p+1} \ldots \times \beta^{n-p}|}{|d_{0}.d_{1}d_{2}\ldots \times \beta^{n}|}\\
&= \frac{|d_{p}.d_{p+1}d_{p+2}\ldots|}{|d_{0}.d_{1}d_{2}\ldots|} \times \beta^{-p}
\end{align}</math>
In order to determine the maximum of this quantity, there is a need to find the maximum of the numerator and the minimum of the denominator. Since <math>d_{0}\neq 0</math> (normalized system), the minimum value of the denominator is <math>1</math>. The numerator is bounded above by <math>(\beta-1).(\beta-1){\overline{(\beta-1)}}=\beta </math>. Thus, <math>\frac{|x-fl(x)|}{|x|} \leq \frac{\beta}{1} \times \beta^{-p} = \beta^{1-p}</math>. Therefore, <math>\epsilon=\beta^{1-p}</math> for round-by-chop.
The proof for round-to-nearest is similar.
* Note that the first definition of machine epsilon is not quite equivalent to the second definition when using the round-to-nearest rule but it is equivalent for round-by-chop.