Editing Round-off error (section)

== Floating-point number system ==

Compared with the [[fixed-point arithmetic|fixed-point number system]], the [[floating-point arithmetic|floating-point number system]] is more efficient in representing real numbers so it is widely used in modern computers. While the real numbers <math>\mathbb{R}</math> are infinite and continuous, a floating-point number system <math>F</math> is finite and discrete. Thus, representation error, which leads to roundoff error, occurs under the floating-point number system.

=== Notation of floating-point number system ===
A floating-point number system <math>F</math> is characterized by <math>4</math> integers:
*<math> \beta </math>: base or radix
*<math>p</math>: precision
*<math> [L, U] </math>: exponent range, where <math>L</math> is the lower bound and <math>U</math> is the upper bound

Any <math>x \in F</math> has the following form: 
<math display="block"> x = \pm (\underbrace{d_{0}.d_{1}d_{2}\ldots d_{p-1}}_\text{significand})_{\beta}  \times \beta ^{\overbrace{E}^\text{exponent}} = \pm d_{0}\times \beta ^{E}+d_{1}\times \beta ^{E-1}+\ldots+ d_{p-1}\times \beta ^{E-(p-1)}</math>
where <math>d_{i}</math> is an integer such that <math>0 \leq d_{i} \leq \beta-1</math> for <math>i = 0, 1, \ldots, p-1</math>, and <math>E</math> is an integer such that <math>L \leq E \leq U</math>.

=== Normalized floating-number system ===

* A floating-point number system is normalized if the leading digit <math>d_{0}</math> is always nonzero unless the number is zero.<ref name="Forrester_2018"/> Since the [[significand]] is <math>d_{0}.d_{1}d_{2}\ldots d_{p-1}</math>, the significand of a nonzero number in a normalized system satisfies <math>1 \leq \text{significand} < \beta ^{p}</math>. Thus, the normalized form of a nonzero [[Institute of Electrical and Electronics Engineers|IEEE]] floating-point number is <math>\pm 1.bb \ldots b \times 2^{E}</math> where <math>b \in {0, 1}</math>. In binary, the leading digit is always <math>1</math> so it is not written out and is called the implicit bit. This gives an extra bit of precision so that the roundoff error caused by representation error is reduced.
* Since floating-point number system <math>F</math> is finite and discrete, it cannot represent all real numbers which means infinite real numbers can only be approximated by some finite numbers through [[rounding|rounding rule]]s. The floating-point approximation of a given real number <math>x</math> by <math>fl(x)</math> can be denoted.
** The total number of normalized floating-point numbers is <math display="block">2(\beta -1)\beta^{p-1} (U-L+1)+1,</math> where
*** <math>2</math> counts choice of sign, being positive or negative
*** <math>(\beta -1)</math> counts choice of the leading digit
*** <math>\beta^{p-1}</math> counts remaining significand digits
*** <math>U-L+1</math> counts choice of exponents
*** <math>1</math> counts the case when the number is <math>0</math>.

=== IEEE standard ===

In the [[Institute of Electrical and Electronics Engineers|IEEE]] standard the base is binary, i.e. <math>\beta = 2</math>, and normalization is used. The IEEE standard stores the sign, exponent, and significand in separate fields of a floating point word, each of which has a fixed width (number of bits). The two most commonly used levels of precision for floating-point numbers are single precision and double precision. 
{| class="wikitable" style="margin:1em auto"
! Precision
! Sign (bits)
! Exponent (bits)
! Trailing Significand field (bits)
|-
|Single || 1 || 8 || 23 
|-
|Double || 1 || 11 || 52
|}