Editing Computer number format (section)

===Floating-point numbers===
While both unsigned and signed integers are used in digital systems, even a 32-bit integer is not enough to handle all the range of numbers a calculator can handle, and that's not even including fractions. To approximate the greater range and precision of [[real number]]s, we have to abandon signed integers and fixed-point numbers and go to a "[[floating-point arithmetic|floating-point]]" format.

In the decimal system, we are familiar with floating-point numbers of the form ([[scientific notation]]):

: 1.1030402 &times; 10<sup>5</sup> = 1.1030402 &times; 100000 = 110304.02

or, more compactly:

: 1.1030402E5

which means "1.1030402 times 1 followed by 5 zeroes". We have a certain numeric value (1.1030402) known as a "[[significand]]", multiplied by a power of 10 (E5, meaning 10<sup>5</sup> or 100,000), known as an "[[exponentiation|exponent]]". If we have a negative exponent, that means the number is multiplied by a 1 that many places to the right of the decimal point. For example:

: 2.3434E&minus;6 = 2.3434 &times; 10<sup>−6</sup> = 2.3434 &times; 0.000001 = 0.0000023434

The advantage of this scheme is that by using the exponent we can get a much wider range of numbers, even if the number of digits in the significand, or the "numeric precision", is much smaller than the range. Similar binary floating-point formats can be defined for computers. There is a number of such schemes, the most popular has been defined by [[Institute of Electrical and Electronics Engineers]] (IEEE). The [[IEEE floating point|IEEE 754-2008]] standard specification defines a 64 bit floating-point format with:

* an 11-bit binary exponent, using "excess-1023" format. Excess-1023 means the exponent appears as an unsigned binary integer from 0 to 2047; subtracting 1023 gives the actual signed value
* a 52-bit significand, also an unsigned binary number, defining a fractional value with a leading implied "1"
* a sign bit, giving the sign of the number.

With the bits stored in 8 bytes of memory:

{| class="wikitable"
|-
! byte 0 
| S || x10 || x9 || x8 || x7 || x6 || x5 || x4
|-
! byte 1 
| x3 || x2 || x1 || x0 || m51 || m50 || m49 || m48
|-
! byte 2 
| m47 || m46 || m45 || m44 || m43 || m42 || m41 || m40
|-
! byte 3 
| m39 || m38 || m37 || m36 || m35 || m34 || m33 || m32
|-
! byte 4 
| m31 || m30 || m29 || m28 || m27 || m26 || m25 || m24
|-
! byte 5 
| m23 || m22 || m21 || m20 || m19 || m18 || m17 || m16
|-
! byte 6 
| m15 || m14 || m13 || m12 || m11 || m10 || m9 || m8
|-
! byte 7 
| m7 || m6 || m5 || m4 || m3 || m2 || m1 || m0
|}

where "S" denotes the sign bit, "x" denotes an exponent bit, and "m" denotes a significand bit. Once the bits here have been extracted, they are converted with the computation:

:   &lt;sign&gt; &times; (1 + &lt;fractional significand&gt;) &times; 2<sup>&lt;exponent&gt; &minus; 1023</sup>

This scheme provides numbers valid out to about 15 decimal digits, with the following range of numbers:
{| class="wikitable" style="text-align:right;font-family:monospace;"
|-
!
! maximum
! minimum
|-
! positive
| 1.797693134862231E+308
| 4.940656458412465E-324
|-
! negative
| -4.940656458412465E-324
| -1.797693134862231E+308
|}

The specification also defines several special values that are not defined numbers, and are known as ''[[NaN]]s'', for "Not A Number". These are used by programs to designate invalid operations and the like. 
 
Some programs also use 32-bit floating-point numbers. The most common scheme uses a 23-bit significand with a sign bit, plus an 8-bit exponent in "excess-127" format, giving seven valid decimal digits.

{| class="wikitable"
|-
! byte 0 
|  S  ||  x7 ||  x6 ||  x5 ||  x4 ||  x3 ||  x2 ||  x1
|-
! byte 1 
|  x0 || m22 || m21 || m20 || m19 || m18 || m17 || m16
|-
! byte 2 
| m15 || m14 || m13 || m12 || m11 || m10 ||  m9 ||  m8
|-
! byte 3 
|  m7 ||  m6 ||  m5 ||  m4 ||  m3 ||  m2 ||  m1 ||  m0
|}
The bits are converted to a numeric value with the computation:

: &lt;sign&gt; &times; (1 + &lt;fractional significand&gt;) &times; 2<sup>&lt;exponent&gt;  &minus; 127</sup>

leading to the following range of numbers:
{| class="wikitable" style="text-align:right;font-family:monospace;"
|-
!
! maximum
! minimum
|-
! positive
| 3.402823E+38
| 2.802597E-45
|-
! negative
| -2.802597E-45
| -3.402823E+38
|}

Such floating-point numbers are known as "reals" or "floats" in general, but with a number of  variations:

A 32-bit float value is sometimes called a "real32" or a "single", meaning "single-precision floating-point value".

A 64-bit float is sometimes called a "real64" or a "double", meaning "double-precision floating-point value".

The relation between numbers and bit patterns is chosen for convenience in computer manipulation; eight bytes stored in computer memory may represent a 64-bit real, two 32-bit reals, or four signed or unsigned integers, or some other kind of data that fits into eight bytes. The only difference is how the computer interprets them. If the computer stored four unsigned integers and then read them back from memory as a 64-bit real, it almost always would be a perfectly valid real number, though it would be junk data.

Only a finite range of real numbers can be represented with a given number of bits. Arithmetic operations can overflow or underflow, producing a value too large or too small to be represented.

The representation has a limited precision. For example, only 15 decimal digits can be represented with a 64-bit real. If a very small floating-point number is added to a large one, the result is just the large one. The small number was too small to even show up in 15 or 16 digits of resolution, and the computer effectively discards it. Analyzing the effect of limited precision is a well-studied problem. Estimates of the magnitude of round-off errors and methods to limit their effect on large calculations are part of any large computation project. The precision limit is different from the range limit, as it affects the significand, not the exponent.

The significand is a binary fraction that doesn't necessarily perfectly match a decimal fraction. In many cases a sum of reciprocal powers of 2 does not match a specific decimal fraction, and the results of computations will be slightly off. For example, the decimal fraction "0.1" is equivalent to an infinitely repeating binary fraction: 0.000110011 ...<ref>{{cite web|last=Goebel|first=Greg|title=Computer Numbering Format|url=http://www.vectorsite.net/tsfloat.html|archive-url=https://archive.today/20130222091425/http://www.vectorsite.net/tsfloat.html|url-status=usurped|archive-date=February 22, 2013|access-date=10 September 2012}}</ref>