Editing Pentium Pro (section)

===Summary===
{{More citations needed section|date=March 2014}}

The Pentium Pro incorporated a new [[microarchitecture]], different from the Pentium's [[P5 (microarchitecture)|P5]] microarchitecture. It has a decoupled, 14-stage superpipelined architecture which used an instruction pool.
The Pentium Pro ([[P6 (microarchitecture)|P6]]) implemented many radical architectural differences mirroring other contemporary [[x86]] designs such as the [[NexGen]] [[Nx586]] and [[Cyrix]] [[6x86]]. The Pentium Pro pipeline had extra decode stages to dynamically translate [[IA-32]] instructions into buffered [[micro-operation]] sequences which could then be analysed, reordered, and renamed in order to detect parallelizable operations that may be issued to more than one [[execution unit]] at once. The Pentium Pro thus featured [[out-of-order execution]], including [[speculative execution]] via [[register renaming]]. It also had a wider 36-bit [[address bus]], usable by [[Physical Address Extension]] (PAE), allowing it to access up to 64 GB ({{nowrap|64{{nbsp}}×{{nbsp}}1024<sup>3</sup> bytes)}} of memory.

The Pentium Pro has an 8&nbsp;KB [[instruction cache]], from which up to 16 bytes are [[Instruction cycle#Summary of stages|fetched]] on each cycle and sent to the [[instruction decoder]]s. There are three instruction decoders. The decoders are unequal in ability: only one can decode any x86 instruction, while the other two can only decode simple x86 instructions. This restricts the Pentium Pro's ability to decode multiple instructions simultaneously, limiting superscalar execution. x86 instructions are decoded into 118-bit [[micro-operation]]s (micro-ops). The micro-ops are [[reduced instruction set computer]] (RISC)-like; that is, they encode an operation, two sources, and a destination. The general decoder can generate up to four micro-ops per cycle, whereas the simple decoders can generate one micro-op each per cycle. Thus, x86 instructions that operate on the memory (e.g., add this register to this location in the memory) can only be processed by the general decoder, as this operation requires a minimum of three micro-ops. Likewise, the simple decoders are limited to instructions that can be translated into one micro-op. Instructions that require more micro-ops than four are translated with the assistance of a sequencer, which generates the required micro-ops over multiple clock cycles. The Pentium Pro was the first processor in the x86 family to support upgradeable [[microcode]] under [[BIOS]] and/or [[operating system]] (OS) control.{{r|Stiller_1996}}

Micro-ops exit the [[re-order buffer]] (ROB) and enter a reserve station (RS), where they await dispatch to the execution units. In each clock cycle, up to five micro-ops can be dispatched to five execution units. The Pentium Pro has a total of six execution units: two integer units, one [[floating-point unit]] (FPU), a load unit, store address unit, and a store data unit.<ref name="iaopt">{{cite web |url=ftp://download.intel.com/design/PentiumII/manuals/24281603.PDF |archive-url=https://web.archive.org/web/20070121103522/http://download.intel.com:80/design/PentiumII/manuals/24281603.PDF |archive-date=2007-01-21 |url-status=dead |title=Intel Architecture Optimization Manual |page=2{{hyp}}8 |date=1997 }}</ref> One of the integer units shares the same ports as the FPU, and therefore the Pentium Pro can only dispatch one integer micro-op and one floating-point micro-op, or two integer micro-ops per a cycle, in addition to micro-ops for the other three execution units. Of the two integer units, only the one that shares the path with the FPU on port 0 has the full complement of functions such as a [[barrel shifter]], multiplier, divider, and support for LEA instructions. The second integer unit, which is connected to port 1, does not have these facilities and is limited to simple operations such as add, subtract, and the calculation of branch target addresses.<ref name="iaopt"/>

The FPU executes floating-point operations. Addition and multiplication are pipelined and have a latency of three and five cycles, respectively. Division and square-root are not pipelined and are executed in separate units that share the FPU's ports. Division and square root have a latency of 18-36 and 29-69 cycles, respectively. The smallest number is for single precision (32-bit) floating-point numbers and the largest for extended precision (80-bit) numbers. Division and square root can operate simultaneously with adds and multiplies, preventing them from executing only when the result has to be stored in the ROB.

After the microprocessor was released, a bug was discovered in the [[floating point unit]], commonly called the "Pentium Pro and Pentium II FPU bug" and by Intel as the "flag erratum". The bug occurs under some circumstances during floating point-to-integer conversion when the floating point number will not fit into the smaller integer format, causing the FPU to deviate from its documented behaviour. The bug is considered to be minor and occurs under such special circumstances that very few, if any, software programs are affected.

The Pentium Pro [[P6 (microarchitecture)|P6 microarchitecture]] was used in one form or another by Intel for more than a decade. The pipeline would scale from its initial 150&nbsp;MHz start, all the way up to 1.4&nbsp;GHz with the "Tualatin" [[Pentium III]]. The design's various traits would continue after that in the derivative core called "[[Banias (microprocessor)|Banias]]" in [[Pentium M]] and [[Intel Core]] ([[Yonah (microprocessor)|Yonah]]), which itself would evolve into the [[Intel Core (microarchitecture)|Core microarchitecture]] ([[Core 2]] processor) in 2006 and onward.{{r|Stokes_20060405}}