Editing Floating-point arithmetic (section)

== Overview ==
=== Floating-point numbers ===
A [[number representation]] specifies some way of encoding a number, usually as a string of digits.

There are several mechanisms by which strings of digits can represent numbers. In standard mathematical notation, the digit string can be of any length, and the location of the [[radix point]] is indicated by placing an explicit [[Decimal separator|"point" character]] (dot or comma) there. If the radix point is not specified, then the string implicitly represents an [[integer]] and the unstated radix point would be off the right-hand end of the string, next to the least significant digit. In [[fixed-point arithmetic|fixed-point]] systems, a position in the string is specified for the radix point. So a fixed-point scheme might use a string of 8 decimal digits with the decimal point in the middle, whereby "00012345" would represent 0001.2345.

In [[scientific notation]], the given number is scaled by a [[power of 10]], so that it lies within a specific range—typically between 1 and 10, with the radix point appearing immediately after the first digit. As a power of ten, the scaling factor is then indicated separately at the end of the number. For example, the orbital period of [[Jupiter]]'s moon [[Io (moon)|Io]] is {{val|152853.5047|fmt=commas}} seconds, a value that would be represented in standard-form scientific notation as {{val|1.528535047|e=5|fmt=commas}} seconds.

Floating-point representation is similar in concept to scientific notation. Logically, a floating-point number consists of:
* A signed (meaning positive or negative) digit string of a given length in a given [[radix]] (or base). This digit string is referred to as the ''[[significand]]'', ''mantissa'', or ''coefficient''.<ref group="nb" name="NB_Significand"/> The length of the significand determines the ''precision'' to which numbers can be represented.  The radix point position is assumed always to be somewhere within the significand—often just after or just before the most significant digit, or to the right of the rightmost (least significant) digit. This article generally follows the convention that the radix point is set just after the most significant (leftmost) digit.
* A signed integer [[exponent]] (also referred to as the ''characteristic'', or ''scale''),<ref group="nb" name="NB_Exponent"/> which modifies the magnitude of the number.

To derive the value of the floating-point number, the ''significand'' is multiplied by the ''base'' raised to the power of the ''exponent'', equivalent to shifting the radix point from its implied position by a number of places equal to the value of the exponent—to the right if the exponent is positive or to the left if the exponent is negative.

Using base-10 (the familiar [[Decimal representation|decimal]] notation) as an example, the number {{val|152853.5047|fmt=commas}}, which has ten decimal digits of precision, is represented as the significand {{val|1528535047|fmt=commas}} together with 5 as the exponent. To determine the actual value, a decimal point is placed after the first digit of the significand and the result is multiplied by {{10^|5}} to give {{val|1.528535047|e=5|fmt=commas}}, or {{val|152853.5047|fmt=commas}}. In storing such a number, the base (10) need not be stored, since it will be the same for the entire range of supported numbers, and can thus be inferred.

Symbolically, this final value is:
<math display=block>\frac{s}{b^{\,p-1}} \times b^e,</math>

where {{mvar|s}} is the significand (ignoring any implied decimal point), {{mvar|p}} is the precision (the number of digits in the significand), {{mvar|b}} is the base (in our example, this is the number ''ten''), and {{mvar|e}} is the exponent.

{{anchor|Base-4|Base-8|Base-256|Base-65536}}Historically, several number bases have been used for representing floating-point numbers, with base two ([[Binary numeral system|binary]]) being the most common, followed by base ten ([[decimal floating point]]), and other less common varieties, such as base sixteen ([[hexadecimal floating point]]<ref name="Zehendner_2008"/><ref name="Beebe_2017"/><ref group="nb" name="NB_9"/>), base eight (octal floating point<ref name="Muller_2010"/><ref name="Beebe_2017"/><ref name="Savard_2018"/><ref name="Zehendner_2008"/><ref group="nb" name="NB_8"/>), base four (quaternary floating point<ref name="Parkinson_2000"/><ref name="Beebe_2017"/><ref group="nb" name="NB_11"/>), base three ([[balanced ternary floating point]]<ref name="Muller_2010"/>) and even base 256<ref name="Beebe_2017"/><ref group="nb" name="NB_12"/> and base {{val|65536|fmt=commas}}.<ref name="Lazarus_1956"/><ref group="nb" name="NB_10"/>

A floating-point number is a [[rational number]], because it can be represented as one integer divided by another; for example {{val|1.45|e=3}} is (145/100)×1000 or {{val|145000|fmt=commas}}/100. The base determines the fractions that can be represented; for instance, 1/5 cannot be represented exactly as a floating-point number using a binary base, but 1/5 can be represented exactly using a decimal base ({{val|0.2}}, or {{val|2|e=-1}}). However, 1/3 cannot be represented exactly by either binary (0.010101...) or decimal (0.333...), but in [[ternary numeral system|base 3]], it is trivial (0.1 or 1×3<sup>−1</sup>) . The occasions on which infinite expansions occur [[positional notation#Infinite representations|depend on the base and its prime factors]].

<!-- Note: The following text contains information about how a number can be rounded to nearest. Such information may come too early in this article. Then, it could be more detailed: rounding can yield an increase of the exponent, and a possible overflow. Moreover, in general, one does not necessarily know the first N bits of a number, but just an approximation (this is not equivalent when the Table maker's dilemma occurs). Thus the text below might be misleading. -->
The way in which the significand (including its sign) and exponent are stored in a computer is implementation-dependent. The common IEEE formats are described in detail later and elsewhere, but as an example, in the binary single-precision (32-bit) floating-point representation, <math>p = 24</math>, and so the significand is a string of 24 [[bit]]s.  For instance, the number [[Pi|π]]'s first 33 bits are:
<math display=block>11001001\ 00001111\ 1101101\underline{0}\ 10100010\ 0.</math>

In this binary expansion, let us denote the positions from 0 (leftmost bit, or most significant bit) to 32 (rightmost bit). The 24-bit significand will stop at position&nbsp;23, shown as the underlined bit {{val|0}} above. The next bit, at position&nbsp;24, is called the ''round bit'' or ''rounding bit''. It is used to round the 33-bit approximation to the nearest 24-bit number (there are [[rounding#Tie-breaking|specific rules for halfway values]], which is not the case here). This bit, which is {{val|1=1}} in this example, is added to the integer formed by the leftmost 24 bits, yielding:
<math display=block>11001001\ 00001111\ 1101101\underline{1}.</math>

When this is stored in memory using the IEEE 754 encoding, this becomes the [[significand]] {{mvar|s}}. The significand is assumed to have a binary point to the right of the leftmost bit. So, the binary representation of π is calculated from left-to-right as follows:
<math display=block>\begin{align}
            &\left(\sum_{n=0}^{p-1} \text{bit}_n \times 2^{-n}\right) \times 2^e \\
        ={} &\left(1 \times 2^{-0} + 1 \times 2^{-1} + 0 \times 2^{-2} + 0 \times 2^{-3} + 1 \times2^{-4} + \cdots + 1 \times 2^{-23}\right) \times 2^1 \\
  \approx{} &1.57079637 \times 2 \\
  \approx{} &3.1415927
\end{align}</math><!-- Ensure correct rounding by taking one more digit for the intermediate decimal approximation. -->

where {{mvar|p}} is the precision ({{val|24}} in this example), {{mvar|n}} is the position of the bit of the significand from the left (starting at {{val|0}} and finishing at {{val|23}} here) and {{mvar|e}} is the exponent ({{val|1=1}} in this example).

{{anchor|Hidden bit}}It can be required that the most significant digit of the significand of a non-zero number be non-zero (except when the corresponding exponent would be smaller than the minimum one). This process is called ''normalization''. For binary formats (which uses only the digits {{val|0}} and {{val|1=1}}), this non-zero digit is necessarily {{val|1}}. Therefore, it does not need to be represented in memory, allowing the format to have one more bit of precision. This rule is variously called the ''leading bit convention'', the ''implicit bit convention'', the ''hidden bit convention'',<ref name="Muller_2010"/> or the ''assumed bit convention''.

=== Alternatives to floating-point numbers ===
The floating-point representation is by far the most common way of representing in computers an approximation to real numbers. However, there are alternatives:
* [[Fixed-point arithmetic|Fixed-point]] representation uses integer hardware operations controlled by a software implementation of a specific convention about the location of the binary or decimal point, for example, 6 bits or digits from the right. The hardware to manipulate these representations is less costly than floating point, and it can be used to perform normal integer operations, too. Binary fixed point is usually used in special-purpose applications on embedded processors that can only do integer arithmetic, but decimal fixed point is common in commercial applications.
* [[Logarithmic number system]]s (LNSs) represent a real number by the logarithm of its absolute value and a sign bit. The value distribution is similar to floating point, but the value-to-representation curve (''i.e.'', the graph of the logarithm function) is smooth (except at 0). Conversely to floating-point arithmetic, in a logarithmic number system multiplication, division and exponentiation are simple to implement, but addition and subtraction are complex. The ([[symmetric level-index arithmetic|symmetric]]) [[level-index arithmetic]] (LI and SLI) of Charles Clenshaw, [[Frank William John Olver|Frank Olver]] and Peter Turner is a scheme based on a [[generalized logarithm]] representation.
* [[Tapered floating-point representation]], used in [[Unum (number format)|Unum]].
* Some simple rational numbers (''e.g.'', 1/3 and 1/10) cannot be represented exactly in binary floating point, no matter what the precision is. Using a different radix allows one to represent some of them (''e.g.'', 1/10 in decimal floating point), but the possibilities remain limited. Software packages that perform [[fraction|rational arithmetic]] represent numbers as fractions with integral numerator and denominator, and can therefore represent any rational number exactly. Such packages generally need to use "[[bignum]]" arithmetic for the individual integers.
* [[Interval arithmetic]] allows one to represent numbers as intervals and obtain guaranteed bounds on results. It is generally based on other arithmetics, in particular floating point.
* [[Computer algebra system]]s such as [[Mathematica]], [[Maxima (software)|Maxima]], and [[Maple (software)|Maple]] can often handle irrational numbers like <math>\pi</math> or <math>\sqrt{3}</math> in a completely "formal" way ([[symbolic computation]]), without dealing with a specific encoding of the significand. Such a program can evaluate expressions like "<math>\sin (3\pi)</math>" exactly, because it is programmed to process the underlying mathematics directly, instead of using approximate values for each intermediate calculation.