Editing Floating-point arithmetic (section)

== Floating-point operations ==
For ease of presentation and understanding, decimal [[radix]] with 7 digit precision will be used in the examples, as in the IEEE 754 ''decimal32'' format. The fundamental principles are the same in any [[radix]] or precision, except that normalization is optional (it does not affect the numerical value of the result). Here, ''s'' denotes the significand and ''e'' denotes the exponent.

=== Addition and subtraction ===
A simple method to add floating-point numbers is to first represent them with the same exponent. In the example below, the second number (with the smaller exponent) is shifted right by three digits, and one then proceeds with the usual addition method:

   123456.7 = 1.234567 × 10^5
   101.7654 = 1.017654 × 10^2 = 0.001017654 × 10^5

   Hence:
   123456.7 + 101.7654 = (1.234567 × 10^5) + (1.017654 × 10^2)
                       = (1.234567 × 10^5) + (0.001017654 × 10^5)
                       = (1.234567 + 0.001017654) × 10^5
                       =  1.235584654 × 10^5

In detail:

   e=5;  s=1.234567     (123456.7)
 + e=2;  s=1.017654     (101.7654)

   e=5;  s=1.234567
 + e=5;  s=0.001017654  (after shifting)
 --------------------
   e=5;  s=1.235584654  (true sum: 123558.4654)

This is the true result, the exact sum of the operands. It will be rounded to seven digits and then normalized if necessary. The final result is
   e=5;  s=1.235585    (final sum: 123558.5)

The lowest three digits of the second operand (654) are essentially lost. This is [[round-off error]]. In extreme cases, the sum of two non-zero numbers may be equal to one of them:

   e=5;  s=1.234567
 + e=−3; s=9.876543

   e=5;  s=1.234567
 + e=5;  s=0.00000009876543 (after shifting)
 ----------------------
   e=5;  s=1.23456709876543 (true sum)
   e=5;  s=1.234567         (after rounding and normalization)

In the above conceptual examples it would appear that a large number of extra digits would need to be provided by the adder to ensure correct rounding; however, for binary addition or subtraction using careful implementation techniques only a ''guard'' bit, a ''rounding'' bit and one extra ''sticky'' bit need to be carried beyond the precision of the operands.<ref name="Goldberg_1991"/><ref name="Patterson-Hennessy_2014"/>{{rp|218–220}}

Another problem of loss of significance occurs when ''approximations'' to two nearly equal numbers are subtracted. In the following example ''e''&nbsp;=&nbsp;5; ''s''&nbsp;=&nbsp;1.234571 and ''e''&nbsp;=&nbsp;5; ''s''&nbsp;=&nbsp;1.234567 are approximations to the rationals 123457.1467 and 123456.659.

   e=5;  s=1.234571
 − e=5;  s=1.234567
 ----------------
   e=5;  s=0.000004
   e=−1; s=4.000000 (after rounding and normalization)

The floating-point difference is computed exactly because the numbers are close—the [[Sterbenz lemma]] guarantees this, even in case of underflow when [[gradual underflow]] is supported. Despite this, the difference of the original numbers is ''e''&nbsp;=&nbsp;−1; ''s''&nbsp;=&nbsp;4.877000, which differs more than 20% from the difference ''e''&nbsp;=&nbsp;−1; ''s''&nbsp;=&nbsp;4.000000 of the approximations. In extreme cases, all significant digits of precision can be lost.<ref name="Goldberg_1991"/><ref name="Sierra_1962"/> This ''[[Catastrophic cancellation|cancellation]]'' illustrates the danger in assuming that all of the digits of a computed result are meaningful. Dealing with the consequences of these errors is a topic in [[numerical analysis]]; see also [[#Accuracy problems|Accuracy problems]].

=== Multiplication and division ===
To multiply, the significands are multiplied while the exponents are added, and the result is rounded and normalized.

   e=3;  s=4.734612
 × e=5;  s=5.417242
 -----------------------
   e=8;  s=25.648538980104 (true product)
   e=8;  s=25.64854        (after rounding)
   e=9;  s=2.564854        (after normalization)

Similarly, division is accomplished by subtracting the divisor's exponent from the dividend's exponent, and dividing the dividend's significand by the divisor's significand.

There are no cancellation or absorption problems with multiplication or division, though small errors may accumulate as operations are performed in succession.<ref name="Goldberg_1991"/> In practice, the way these operations are carried out in digital logic can be quite complex (see [[Booth's multiplication algorithm]] and [[Division algorithm]]).<ref group="nb" name="NB_2"/>

=== Literal syntax ===
Literals for floating-point numbers depend on languages. They typically use <code>e</code> or <code>E</code> to denote [[scientific notation]]. The [[C (programming language)|C programming language]] and the [[IEEE 754]] standard also define a [[IEEE 754#Hexadecimal literals|hexadecimal literal syntax]] with a base-2 exponent instead of 10.<!-- Also in [[Hexadecimal#Hexadecimal exponential notation]] --> In languages like [[C (programming language)|C]], when the decimal exponent is omitted, a decimal point is needed to differentiate them from integers. Other languages do not have an integer type (such as [[JavaScript]]), or allow overloading of numeric types (such as [[Haskell (programming language)|Haskell]]). In these cases, digit strings such as <code>123</code> may also be floating-point literals.

Examples of floating-point literals are:
* <code>99.9</code>
* <code>-5000.12</code><!-- do not use ndash, as that isn't part of a literal(?)-->
* <code>6.02e23</code>
* <code>-3e-45</code>
* <code>0x1.fffffep+127</code> in C and IEEE 754