Editing Floating-point arithmetic (section)

== Other notable floating-point formats ==
In addition to the widely used [[IEEE 754]] standard formats, other floating-point formats are used, or have been used, in certain domain-specific areas.
* The [[Microsoft Binary Format|Microsoft Binary Format (MBF)]] was developed for the Microsoft BASIC language products, including Microsoft's first ever product the [[Altair BASIC]] (1975), [[TRS-80|TRS-80 LEVEL II]], [[CP/M]]'s [[MBASIC]], [[IBM PC 5150]]'s [[BASICA]], [[MS-DOS]]'s [[GW-BASIC]] and [[QuickBASIC]] prior to version 4.00. QuickBASIC version 4.00 and 4.50 switched to the IEEE 754-1985 format but can revert to the MBF format using the /MBF command option. MBF was designed and developed on a simulated [[Intel 8080]] by [[Monte Davidoff]], a dormmate of [[Bill Gates]], during spring of 1975 for the [[MITS Altair 8800]]. The initial release of July 1975 supported a single-precision (32 bits) format due to cost of the [[MITS Altair 8800]] 4-kilobytes memory. In December 1975, the 8-kilobytes version added a double-precision (64 bits) format.  A single-precision (40 bits) variant format was adopted for other CPU's, notably the [[MOS 6502]] ([[Apple II]], [[Commodore PET]], [[Atari]]), [[Motorola 6800]] (MITS Altair 680) and [[Motorola 6809]] ([[TRS-80 Color Computer]]).  All Microsoft language products from 1975 through 1987 used the [[Microsoft Binary Format]] until Microsoft adopted the IEEE 754 standard format in all its products starting in 1988 to their current releases. MBF consists of the MBF single-precision format (32 bits, "6-digit BASIC"),<ref name="Borland_1994_MBF"/><ref name="Steil_2008_6502"/> the MBF extended-precision format (40 bits, "9<!-- is it really 9 digits, not 8? -->-digit BASIC"),<ref name="Steil_2008_6502"/> and the MBF double-precision format (64 bits);<ref name="Borland_1994_MBF"/><ref name="Microsoft_2006_KB35826"/> each of them is represented with an 8-bit exponent, followed by a sign bit, followed by a significand of respectively 23, 31, and 55 bits.
* The [[bfloat16 floating-point format|bfloat16 format]] requires the same amount of memory (16 bits) as the [[Half-precision floating-point format|IEEE 754 half-precision format]], but allocates 8 bits to the exponent instead of 5, thus providing the same range as a [[Single-precision floating-point format|IEEE 754 single-precision]] number. The tradeoff is a reduced precision, as the trailing significand field is reduced from 10 to 7 bits. This format is mainly used in the training of [[machine learning]] models, where range is more valuable than precision. Many machine learning accelerators provide hardware support for this format.
* The TensorFloat-32<ref name="Kharya_2020"/> format combines the 8 bits of exponent of the bfloat16 with the 10 bits of trailing significand field of half-precision formats, resulting in a size of 19 bits. This format was introduced by [[Nvidia]], which provides hardware support for it in the Tensor Cores of its [[Graphics processing unit|GPUs]] based on the Nvidia Ampere architecture. The drawback of this format is its size, which is not a power of 2. However, according to Nvidia, this format should only be used internally by hardware to speed up computations, while inputs and outputs should be stored in the 32-bit single-precision IEEE 754 format.<ref name="Kharya_2020"/>
* The [[Hopper (microarchitecture)|Hopper]] architecture GPUs provide two FP8 formats: one with the same numerical range as half-precision (E5M2) and one with higher precision, but less range (E4M3).<ref name="NVIDIA_Hopper"/><ref name="Micikevicius_2022"/>
* The [[Blackwell (microarchitecture)|Blackwell]] GPU architecture includes support for FP6 (E3M2 and E2M3) and FP4 (E2M1) formats. FP4 is the smallest floating-point format which allows for all IEEE 754 principles (see [[minifloat]]).

{| class="wikitable"
|+ Comparison of common floating-point formats
!Type
!Sign
!Exponent
!Significand
!Total bits
|-
|FP4
|1
|2
|1
|4
|-
|FP6 (E2M3)
|1
|2
|3
|6
|-
|FP6 (E3M2)
|1
|3
|2
|6
|-
|FP8 (E4M3)
|1
|4
|3
|8
|-
|FP8 (E5M2)
|1
|5
|2
|8
|-
|[[Half-precision floating-point format|Half-precision]]
|1
|5
|10
|16
|-
|[[bfloat16 floating-point format|bfloat16]]
|1
|8
|7
|16
|-
|[[TensorFloat-32]]
|1
|8
|10
|19
|-
|[[Single-precision floating-point format|Single-precision]]
|1
|8
|23
|32
|-
|[[Double-precision floating-point format|Double-precision]]
|1
|11
|52
|64
|-
|[[Quadruple-precision floating-point format|Quadruple-precision]]
|1
|15
|112
|128
|-
|[[Octuple-precision floating-point format|Octuple-precision]]
|1
|19
|236
|256
|}