Editing Double-precision floating-point format

{{short description|64-bit computer number format}}
{{pp-sock|small=yes}}
'''Double-precision floating-point format''' (sometimes called '''FP64''' or '''float64''') is a [[floating-point arithmetic|floating-point]] [[computer number format|number format]], usually occupying 64 [[Bit|bits]] in computer memory; it represents a wide range of numeric values by using a floating [[radix point]].

Double precision may be chosen when the range or precision of [[single-precision floating-point format|single precision]] would be insufficient.

In the [[IEEE 754]] [[standardization|standard]], the 64-bit base-2 format is officially referred to as '''binary64'''; it was called '''double''' in [[IEEE 754-1985]]. IEEE 754 specifies additional floating-point formats, including 32-bit base-2 ''single precision'' and, more recently, base-10 representations ([[decimal floating point]]).

One of the first [[programming language]]s to provide floating-point data types was [[Fortran]].{{Citation needed|date=September 2023}} Before the widespread adoption of IEEE 754-1985, the representation and properties of floating-point data types depended on the [[computer manufacturer]] and computer model, and upon decisions made by programming-language implementers. E.g., [[GW-BASIC]]'s double-precision data type was the [[64-bit MBF]] floating-point format.
{{Floating-point}}

==IEEE 754 double-precision binary floating-point format: binary64==
Double-precision binary floating-point is a commonly used format on PCs, due to its wider range over single-precision floating point, in spite of its performance and bandwidth cost. It is commonly known simply as ''double''. The IEEE 754 standard specifies a '''binary64''' as having:
* [[Sign bit]]: 1 bit
* [[Exponent]]: 11 bits
* [[Significand]] [[precision (arithmetic)|precision]]: 53 bits (52 explicitly stored)
The sign bit determines the sign of the number (including when this number is zero, which is [[signed zero|signed]]).

The exponent field is an 11-bit unsigned integer from 0 to 2047, in [[Exponent bias|biased form]]: an exponent value of 1023 represents the actual zero. Exponents range from −1022 to +1023 because exponents of −1023 (all 0s) and +1024 (all 1s) are reserved for special numbers.

The 53-bit significand precision gives from 15 to 17 [[Significant figures|significant decimal digits]] precision (2<sup>−53</sup>&nbsp;≈&nbsp;1.11&nbsp;×&nbsp;10<sup>−16</sup>). If a decimal string with at most 15 significant digits is converted to the IEEE 754 double-precision format, giving a normal number, and then converted back to a decimal string with the same number of digits, the final result should match the original string. If an IEEE 754 double-precision number is converted to a decimal string with at least 17 significant digits, and then converted back to double-precision representation, the final result must match the original number.<ref name="whyieee">{{cite web|url=http://www.cs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF|title=Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic|author=William Kahan|date=1 October 1997|url-status=live|page=4|archive-url=https://web.archive.org/web/20120208075518/http://www.cs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF|archive-date=8 February 2012}}</ref>

The format is written with the [[significand]] having an implicit integer bit of value 1 (except for special data, see the exponent encoding below).  With the 52 bits of the fraction (F) significand appearing in the memory format, the total precision is therefore 53 bits (approximately 16 decimal digits, 53 log<sub>10</sub>(2) ≈ 15.955).  The bits are laid out as follows:

[[File:IEEE 754 Double Floating Point Format.svg]]

The real value assumed by a given 64-bit double-precision datum with a given [[Exponent bias|biased exponent]] <math>E</math> 
and a 52-bit fraction is
: <math> (-1)^{\text{sign}}(1.b_{51}b_{50}...b_{0})_2 \times 2^{E-1023} </math>
or
: <math> (-1)^{\text{sign}}\left(1 + \sum_{i=1}^{52} b_{52-i} 2^{-i} \right)\times 2^{E-1023} </math>

Between 2<sup>52</sup>=4,503,599,627,370,496 and 2<sup>53</sup>=9,007,199,254,740,992 the representable numbers are exactly the integers. For the next range, from 2<sup>53</sup> to 2<sup>54</sup>, everything is multiplied by 2, so the representable numbers are the even ones, etc. Conversely, for the previous range from 2<sup>51</sup> to 2<sup>52</sup>, the spacing is 0.5, etc.

The spacing as a fraction of the numbers in the range from 2<sup>''n''</sup> to 2<sup>''n''+1</sup> is 2<sup>''n''−52</sup>.
The maximum relative rounding error when rounding a number to the nearest representable one (the [[machine epsilon]]) is therefore 2<sup>−53</sup>.

The 11 bit width of the exponent allows the representation of numbers between 10<sup>−308</sup> and 10<sup>308</sup>, with full 15–17 decimal digits precision. By compromising precision, the subnormal representation allows even smaller values up to about 5&nbsp;×&nbsp;10<sup>−324</sup>.

===Exponent encoding===
The double-precision binary floating-point exponent is encoded using an [[offset-binary]] representation, with the zero offset being 1023; also known as exponent bias in the IEEE 754 standard. Examples of such representations would be:
{|
|-
|''e'' =<code>00000000001<sub>2</sub></code>=<code>001<sub>16</sub></code>=1:
|style="width: 0.4em"|
| <math>2^{1-1023}=2^{-1022}</math>
|(smallest exponent for [[Normal number (computing)|normal numbers]])
|-
|''e'' =<code>01111111111<sub>2</sub></code>=<code>3ff<sub>16</sub></code>=1023:
|
|<math>2^{1023-1023}=2^0</math>
|(zero offset)
|-
|''e'' =<code>10000000101<sub>2</sub></code>=<code>405<sub>16</sub></code>=1029:
|
|<math>2^{1029-1023}=2^6</math>
|
|-
|''e'' =<code>11111111110<sub>2</sub></code>=<code>7fe<sub>16</sub></code>=2046:
|
|<math>2^{2046-1023}=2^{1023}</math>
|(highest exponent)
|}

The exponents <code>000<sub>16</sub></code> and <code>7ff<sub>16</sub></code> have a special meaning:
* <code>00000000000<sub>2</sub></code>=<code>000<sub>16</sub></code> is used to represent a [[signed zero]] (if ''F'' = 0) and [[subnormal number]]s (if ''F'' ≠ 0); and
* <code>11111111111<sub>2</sub></code>=<code>7ff<sub>16</sub></code> is used to represent [[infinity|∞]] (if ''F'' = 0) and [[NaN]]s (if ''F'' ≠ 0),
where ''F'' is the fractional part of the [[significand]]. All bit patterns are valid encoding.

Except for the above exceptions, the entire double-precision number is described by:

: <math>(-1)^{\text{sign}} \times 2^{e - 1023} \times 1.\text{fraction}</math>

In the case of [[subnormal number]]s (''e'' = 0) the double-precision number is described by:

: <math>(-1)^{\text{sign}} \times 2^{1-1023} \times 0.\text{fraction} = (-1)^{\text{sign}} \times 2^{-1022} \times 0.\text{fraction}</math>

===Endianness===
{{Excerpt|Endianness|Floating point}}

===Double-precision examples===

   0 01111111111 0000000000000000000000000000000000000000000000000000<sub>2</sub>
 ≙ 3FF0 0000 0000 0000<sub>16</sub>
 ≙ +2<sup>0</sup> × 1
 = 1

   0 01111111111 0000000000000000000000000000000000000000000000000001<sub>2</sub>
 ≙ 3FF0 0000 0000 0001<sub>16</sub>
 ≙ +2<sup>0</sup> × (1 + 2<sup>−52</sup>)
 ≈ 1.0000000000000002220 (the smallest number greater than 1)

   0 01111111111 0000000000000000000000000000000000000000000000000010<sub>2</sub>
 ≙ 3FF0 0000 0000 0002<sub>16</sub>
 ≙ +2<sup>0</sup> × (1 + 2<sup>−51</sup>)
 ≈ 1.0000000000000004441 (the second smallest number greater than 1)

   0 10000000000 0000000000000000000000000000000000000000000000000000<sub>2</sub>
 ≙ 4000 0000 0000 0000<sub>16</sub>
 ≙ +2<sup>1</sup> × 1
 = 2

   1 10000000000 0000000000000000000000000000000000000000000000000000<sub>2</sub>
 ≙ C000 0000 0000 0000<sub>16</sub>
 ≙ −2<sup>1</sup> × 1
 = −2

   0 10000000000 1000000000000000000000000000000000000000000000000000<sub>2</sub>
 ≙ 4008 0000 0000 0000<sub>16</sub>
 ≙ +2<sup>1</sup> × 1.1<sub>2</sub>
 = 11<sub>2</sub>
 = 3

   0 10000000001 0000000000000000000000000000000000000000000000000000<sub>2</sub>
 ≙ 4010 0000 0000 0000<sub>16</sub>
 ≙ +2<sup>2</sup> × 1
 = 100<sub>2</sub>
 = 4

   0 10000000001 0100000000000000000000000000000000000000000000000000<sub>2</sub>
 ≙ 4014 0000 0000 0000<sub>16</sub>
 ≙ +2<sup>2</sup> × 1.01<sub>2</sub>
 = 101<sub>2</sub>
 = 5

   0 10000000001 1000000000000000000000000000000000000000000000000000<sub>2</sub>
 ≙ 4018 0000 0000 0000<sub>16</sub>
 ≙ +2<sup>2</sup> × 1.1<sub>2</sub>
 = 110<sub>2</sub>
 = 6

   0 10000000011 0111000000000000000000000000000000000000000000000000<sub>2</sub>
 ≙ 4037 0000 0000 0000<sub>16</sub>
 ≙ +2<sup>4</sup> × 1.0111<sub>2</sub>
 = 10111<sub>2</sub>
 = 23

   0 01111111000 1000000000000000000000000000000000000000000000000000<sub>2</sub>
 ≙ 3F88 0000 0000 0000<sub>16</sub>
 ≙ +2<sup>−7</sup> × 1.1<sub>2</sub>
 = 0.00000011<sub>2</sub>
 = 0.01171875 (3/256)

   0 00000000000 0000000000000000000000000000000000000000000000000001<sub>2</sub>
 ≙ 0000 0000 0000 0001<sub>16</sub>
 ≙ +2<sup>−1022</sup> × 2<sup>−52</sup>
 = 2<sup>−1074</sup>
 ≈ 4.9406564584124654 × 10<sup>−324</sup> (smallest positive subnormal number)

   0 00000000000 1111111111111111111111111111111111111111111111111111<sub>2</sub>
 ≙ 000F FFFF FFFF FFFF<sub>16</sub>
 ≙ +2<sup>−1022</sup> × (1 − 2<sup>−52</sup>)
 ≈ 2.2250738585072009 × 10<sup>−308</sup> (largest subnormal number)

   0 00000000001 0000000000000000000000000000000000000000000000000000<sub>2</sub>
 ≙ 0010 0000 0000 0000<sub>16</sub>
 ≙ +2<sup>−1022</sup> × 1
 ≈ 2.2250738585072014 × 10<sup>−308</sup> (smallest positive normal number)

   0 11111111110 1111111111111111111111111111111111111111111111111111<sub>2</sub>
 ≙ 7FEF FFFF FFFF FFFF<sub>16</sub>
 ≙ +2<sup>1023</sup> × (2 − 2<sup>−52</sup>)
 ≈ 1.7976931348623157 × 10<sup>308</sup> (largest normal number)

   0 00000000000 0000000000000000000000000000000000000000000000000000<sub>2</sub>
 ≙ 0000 0000 0000 0000<sub>16</sub>
 ≙ +0 (positive zero)

   1 00000000000 0000000000000000000000000000000000000000000000000000<sub>2</sub>
 ≙ 8000 0000 0000 0000<sub>16</sub>
 ≙ −0 (negative zero)

   0 11111111111 0000000000000000000000000000000000000000000000000000<sub>2</sub>
 ≙ 7FF0 0000 0000 0000<sub>16</sub>
 ≙ +∞ (positive infinity)

   1 11111111111 0000000000000000000000000000000000000000000000000000<sub>2</sub>
 ≙ FFF0 0000 0000 0000<sub>16</sub>
 ≙ −∞ (negative infinity)

   0 11111111111 0000000000000000000000000000000000000000000000000001<sub>2</sub>
 ≙ 7FF0 0000 0000 0001<sub>16</sub>
 ≙ NaN (sNaN on most processors, such as x86 and ARM)

   0 11111111111 1000000000000000000000000000000000000000000000000001<sub>2</sub>
 ≙ 7FF8 0000 0000 0001<sub>16</sub>
 ≙ NaN (qNaN on most processors, such as x86 and ARM)

   0 11111111111 1111111111111111111111111111111111111111111111111111<sub>2</sub>
 ≙ 7FFF FFFF FFFF FFFF<sub>16</sub>
 ≙ NaN (an alternative encoding of NaN)

   0 01111111101 0101010101010101010101010101010101010101010101010101<sub>2</sub>
 ≙ 3FD5 5555 5555 5555<sub>16</sub>
 ≙ +2<sup>−2</sup> × (1 + 2<sup>−2</sup> + 2<sup>−4</sup> + ... + 2<sup>−52</sup>)
 ≈ 0.33333333333333331483 (closest approximation to <sup>1</sup>/<sub>3</sub>)

   0 10000000000 1001001000011111101101010100010001000010110100011000<sub>2</sub>
 ≙ 4009 21FB 5444 2D18<sub>16</sub>
 ≈ 3.141592653589793116 (closest approximation to π)

[[NaN#Encoding|Encodings of qNaN and sNaN]] are not completely specified in [[IEEE floating point|IEEE 754]] and depend on the processor. Most processors, such as the [[x86]] family and the [[ARM architecture|ARM]] family processors, use the most significant bit of the significand field to indicate a quiet NaN; this is what is recommended by IEEE 754. The [[PA-RISC]] processors use the bit to indicate a signaling NaN.

By default, <sup>1</sup>/<sub>3</sub> rounds down, instead of up like [[single precision]], because of the odd number of bits in the significand.

In more detail:
 Given the hexadecimal representation 3FD5 5555 5555 5555<sub>16</sub>,
   Sign = 0
   Exponent = 3FD<sub>16</sub> = 1021
   Exponent Bias = 1023 (constant value; see above)
   Fraction = 5 5555 5555 5555<sub>16</sub>
   Value = 2<sup>(Exponent − Exponent Bias)</sup> × 1.Fraction – Note that Fraction must not be converted to decimal here
         = 2<sup>−2</sup> × (15 5555 5555 5555<sub>16</sub> × 2<sup>−52</sup>)
         = 2<sup>−54</sup> × 15 5555 5555 5555<sub>16</sub>
         = 0.333333333333333314829616256247390992939472198486328125
         ≈ 1/3

===Execution speed with double-precision arithmetic===
Using double-precision floating-point variables is usually slower than working with their single precision counterparts. One area of computing where this is a particular issue is parallel code running on GPUs. For example, when using [[Nvidia]]'s [[CUDA]] platform, calculations with double precision can take, depending on hardware, from 2 to 32 times as long to complete compared to those done using [[Single-precision floating-point format|single precision]].<ref>{{Cite news|url=https://www.tomshardware.com/news/nvidia-titan-v-110-teraflops,36085.html|title=Nvidia's New Titan V Pushes 110 Teraflops From A Single Chip|date=2017-12-08|work=Tom's Hardware|access-date=2018-11-05|language=en}}</ref>

Additionally, many mathematical functions (e.g., sin, cos, atan2, log, exp and sqrt) need more computations to give accurate double-precision results, and are therefore slower.

=== Precision limitations on integer values{{anchor|9007199254740992}} ===
* Integers from &minus;2<sup>53</sup> to 2<sup>53</sup> (&minus;9,007,199,254,740,992 to 9,007,199,254,740,992) can be exactly represented.
* Integers between 2<sup>53</sup> and 2<sup>54</sup> = 18,014,398,509,481,984 round to a multiple of 2 (even number).
* Integers between 2<sup>54</sup> and 2<sup>55</sup> = 36,028,797,018,963,968 round to a multiple of 4.
* Integers between 2<sup>''n''</sup> and 2<sup>''n''+1</sup> round to a multiple of 2<sup>''n''−52</sup>.

==Implementations==
Doubles are implemented in many programming languages in different ways such as the following. On processors with only dynamic precision, such as [[x86]] without [[SSE2]] (or when SSE2 is not used, for compatibility purpose) and with extended precision used by default, software may have difficulties to fulfill some requirements.

===C and C++===
C and C++ offer a wide variety of [[C data types#Basic types|arithmetic types]]. Double precision is not required by the standards (except by the optional annex F of [[C99]], covering IEEE 754 arithmetic), but on most systems, the <code>double</code> type corresponds to double precision. However, on 32-bit x86 with extended precision by default, some compilers may not conform to the C standard or the arithmetic may suffer from [[Rounding#Double rounding|double rounding]].<ref>{{cite web|url=https://gcc.gnu.org/bugzilla/show_bug.cgi?id=323|title=Bug 323 – optimized code gives strange floating point results|website=gcc.gnu.org|access-date=30 April 2018|url-status=live|archive-url=https://web.archive.org/web/20180430012629/https://gcc.gnu.org/bugzilla/show_bug.cgi?id=323|archive-date=30 April 2018}}</ref>

===Fortran===
[[Fortran]] provides several integer and real types, and the 64-bit type <code>real64</code>, accessible via Fortran's intrinsic module <code>iso_fortran_env</code>, corresponds to double precision.

===Common Lisp===
[[Common Lisp]] provides the types SHORT-FLOAT, SINGLE-FLOAT, DOUBLE-FLOAT and LONG-FLOAT.  Most implementations provide SINGLE-FLOATs and DOUBLE-FLOATs with the other types appropriate synonyms. Common Lisp provides exceptions for catching floating-point underflows and overflows, and the inexact floating-point exception, as per IEEE 754.  No infinities and NaNs are described in the ANSI standard, however, several implementations do provide these as extensions.

===Java===

On [[Java (programming language)|Java]] before version 1.2, every implementation had to be IEEE 754 compliant. Version 1.2 allowed implementations to bring extra precision in intermediate computations for platforms like [[x87]]. Thus a modifier [[strictfp]] was introduced to enforce strict IEEE 754 computations. Strict floating point has been restored in Java&nbsp;17.<ref>{{cite web|first=Joseph D. |last=Darcy |title=JEP 306: Restore Always-Strict Floating-Point Semantics |url=http://openjdk.java.net/jeps/306 |access-date=2021-09-12}}</ref>

===JavaScript===
As specified by the [[ECMAScript]] standard, all arithmetic in [[JavaScript]] shall be done using double-precision floating-point arithmetic.<ref>{{cite book |title=ECMA-262 ECMAScript Language Specification |url=http://www.ecma-international.org/publications/files/ECMA-ST-ARCH/ECMA-262%205th%20edition%20December%202009.pdf |edition=5th |publisher=Ecma International |at=p. 29, §8.5 ''The Number Type'' |url-status=live |archive-url=https://web.archive.org/web/20120313145717/http://www.ecma-international.org/publications/files/ECMA-ST-ARCH/ECMA-262%205th%20edition%20December%202009.pdf |archive-date=2012-03-13}}</ref>
<!-- "shall be" instead of "is" because this may not be the case in practice on processors with only dynamic precision. For instance, Mozilla's JavaScript engine had such a problem in the past: https://bugzilla.mozilla.org/show_bug.cgi?id=264912 -->

===JSON===
The [[JSON]] data encoding format supports numeric values, and the grammar to which numeric expressions must conform has no limits on the precision or range of the numbers so encoded. However, RFC 8259 advises that, since IEEE 754 binary64 numbers are widely implemented, good interoperability can be achieved by implementations processing JSON if they expect no more precision or range than binary64 offers.<ref>{{cite web |url=https://datatracker.ietf.org/doc/html/rfc8259 |title=The JavaScript Object Notation (JSON) Data Interchange Format |date=December 2017 |publisher=Internet Engineering Task Force |access-date=2022-02-01 |last1=Bray |first1=Tim }}</ref>

===Rust and Zig===
[[Rust (programming language)|Rust]] and [[Zig (programming language)|Zig]] have the <code>f64</code> data type.<ref>{{cite web |title=Data Types - The Rust Programming Language |url=https://doc.rust-lang.org/beta/book/ch03-02-data-types.html#floating-point-types |website=doc.rust-lang.org |access-date=10 August 2024}}</ref><ref>{{cite web |title=Documentation - The Zig Programming Language |url=https://ziglang.org/documentation/master/#Floats |website=ziglang.org |access-date=10 August 2024}}</ref>

== See also ==
{{wikifunctions|Z20936}}

==Notes and references==
{{Reflist}}

{{data types}}

[[Category:Binary arithmetic]]
[[Category:Computer arithmetic]]
[[Category:Floating point types]]