Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Floating-point arithmetic
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Other notable floating-point formats == In addition to the widely used [[IEEE 754]] standard formats, other floating-point formats are used, or have been used, in certain domain-specific areas. * The [[Microsoft Binary Format|Microsoft Binary Format (MBF)]] was developed for the Microsoft BASIC language products, including Microsoft's first ever product the [[Altair BASIC]] (1975), [[TRS-80|TRS-80 LEVEL II]], [[CP/M]]'s [[MBASIC]], [[IBM PC 5150]]'s [[BASICA]], [[MS-DOS]]'s [[GW-BASIC]] and [[QuickBASIC]] prior to version 4.00. QuickBASIC version 4.00 and 4.50 switched to the IEEE 754-1985 format but can revert to the MBF format using the /MBF command option. MBF was designed and developed on a simulated [[Intel 8080]] by [[Monte Davidoff]], a dormmate of [[Bill Gates]], during spring of 1975 for the [[MITS Altair 8800]]. The initial release of July 1975 supported a single-precision (32 bits) format due to cost of the [[MITS Altair 8800]] 4-kilobytes memory. In December 1975, the 8-kilobytes version added a double-precision (64 bits) format. A single-precision (40 bits) variant format was adopted for other CPU's, notably the [[MOS 6502]] ([[Apple II]], [[Commodore PET]], [[Atari]]), [[Motorola 6800]] (MITS Altair 680) and [[Motorola 6809]] ([[TRS-80 Color Computer]]). All Microsoft language products from 1975 through 1987 used the [[Microsoft Binary Format]] until Microsoft adopted the IEEE 754 standard format in all its products starting in 1988 to their current releases. MBF consists of the MBF single-precision format (32 bits, "6-digit BASIC"),<ref name="Borland_1994_MBF"/><ref name="Steil_2008_6502"/> the MBF extended-precision format (40 bits, "9<!-- is it really 9 digits, not 8? -->-digit BASIC"),<ref name="Steil_2008_6502"/> and the MBF double-precision format (64 bits);<ref name="Borland_1994_MBF"/><ref name="Microsoft_2006_KB35826"/> each of them is represented with an 8-bit exponent, followed by a sign bit, followed by a significand of respectively 23, 31, and 55 bits. * The [[bfloat16 floating-point format|bfloat16 format]] requires the same amount of memory (16 bits) as the [[Half-precision floating-point format|IEEE 754 half-precision format]], but allocates 8 bits to the exponent instead of 5, thus providing the same range as a [[Single-precision floating-point format|IEEE 754 single-precision]] number. The tradeoff is a reduced precision, as the trailing significand field is reduced from 10 to 7 bits. This format is mainly used in the training of [[machine learning]] models, where range is more valuable than precision. Many machine learning accelerators provide hardware support for this format. * The TensorFloat-32<ref name="Kharya_2020"/> format combines the 8 bits of exponent of the bfloat16 with the 10 bits of trailing significand field of half-precision formats, resulting in a size of 19 bits. This format was introduced by [[Nvidia]], which provides hardware support for it in the Tensor Cores of its [[Graphics processing unit|GPUs]] based on the Nvidia Ampere architecture. The drawback of this format is its size, which is not a power of 2. However, according to Nvidia, this format should only be used internally by hardware to speed up computations, while inputs and outputs should be stored in the 32-bit single-precision IEEE 754 format.<ref name="Kharya_2020"/> * The [[Hopper (microarchitecture)|Hopper]] architecture GPUs provide two FP8 formats: one with the same numerical range as half-precision (E5M2) and one with higher precision, but less range (E4M3).<ref name="NVIDIA_Hopper"/><ref name="Micikevicius_2022"/> * The [[Blackwell (microarchitecture)|Blackwell]] GPU architecture includes support for FP6 (E3M2 and E2M3) and FP4 (E2M1) formats. FP4 is the smallest floating-point format which allows for all IEEE 754 principles (see [[minifloat]]). {| class="wikitable" |+ Comparison of common floating-point formats !Type !Sign !Exponent !Significand !Total bits |- |FP4 |1 |2 |1 |4 |- |FP6 (E2M3) |1 |2 |3 |6 |- |FP6 (E3M2) |1 |3 |2 |6 |- |FP8 (E4M3) |1 |4 |3 |8 |- |FP8 (E5M2) |1 |5 |2 |8 |- |[[Half-precision floating-point format|Half-precision]] |1 |5 |10 |16 |- |[[bfloat16 floating-point format|bfloat16]] |1 |8 |7 |16 |- |[[TensorFloat-32]] |1 |8 |10 |19 |- |[[Single-precision floating-point format|Single-precision]] |1 |8 |23 |32 |- |[[Double-precision floating-point format|Double-precision]] |1 |11 |52 |64 |- |[[Quadruple-precision floating-point format|Quadruple-precision]] |1 |15 |112 |128 |- |[[Octuple-precision floating-point format|Octuple-precision]] |1 |19 |236 |256 |}
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)