Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Pentium Pro
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Microarchitecture== [[File:Intel Pentium Pro Microarchitecture Block Diagram.svg|thumb|266px|right|Block Diagram of the Pentium Pro's Microarchitecture]] [[File:Ppro512K.jpg|thumb|upright|200 MHz Pentium Pro with a 512 KB L2 cache in [[Pin grid array|PGA]] package]] [[File:Pentium Pro Black Edition Front.jpg|thumb|upright|200 MHz Pentium Pro with a 1 MB L2 cache in [[Pin grid array|PPGA]] package]] [[File:Pentiumpro moshen.jpg|thumb|upright|[[Decapping|Decapped]] Pentium Pro 256 KB]] The lead architect of Pentium Pro was [[Fred Pollack]] who was specialized in [[superscalar]]ity and had also worked as the lead engineer of the [[Intel iAPX 432]].{{r|dvorak}} ===Summary=== {{More citations needed section|date=March 2014}} The Pentium Pro incorporated a new [[microarchitecture]], different from the Pentium's [[P5 (microarchitecture)|P5]] microarchitecture. It has a decoupled, 14-stage superpipelined architecture which used an instruction pool. The Pentium Pro ([[P6 (microarchitecture)|P6]]) implemented many radical architectural differences mirroring other contemporary [[x86]] designs such as the [[NexGen]] [[Nx586]] and [[Cyrix]] [[6x86]]. The Pentium Pro pipeline had extra decode stages to dynamically translate [[IA-32]] instructions into buffered [[micro-operation]] sequences which could then be analysed, reordered, and renamed in order to detect parallelizable operations that may be issued to more than one [[execution unit]] at once. The Pentium Pro thus featured [[out-of-order execution]], including [[speculative execution]] via [[register renaming]]. It also had a wider 36-bit [[address bus]], usable by [[Physical Address Extension]] (PAE), allowing it to access up to 64 GB ({{nowrap|64{{nbsp}}×{{nbsp}}1024<sup>3</sup> bytes)}} of memory. The Pentium Pro has an 8 KB [[instruction cache]], from which up to 16 bytes are [[Instruction cycle#Summary of stages|fetched]] on each cycle and sent to the [[instruction decoder]]s. There are three instruction decoders. The decoders are unequal in ability: only one can decode any x86 instruction, while the other two can only decode simple x86 instructions. This restricts the Pentium Pro's ability to decode multiple instructions simultaneously, limiting superscalar execution. x86 instructions are decoded into 118-bit [[micro-operation]]s (micro-ops). The micro-ops are [[reduced instruction set computer]] (RISC)-like; that is, they encode an operation, two sources, and a destination. The general decoder can generate up to four micro-ops per cycle, whereas the simple decoders can generate one micro-op each per cycle. Thus, x86 instructions that operate on the memory (e.g., add this register to this location in the memory) can only be processed by the general decoder, as this operation requires a minimum of three micro-ops. Likewise, the simple decoders are limited to instructions that can be translated into one micro-op. Instructions that require more micro-ops than four are translated with the assistance of a sequencer, which generates the required micro-ops over multiple clock cycles. The Pentium Pro was the first processor in the x86 family to support upgradeable [[microcode]] under [[BIOS]] and/or [[operating system]] (OS) control.{{r|Stiller_1996}} Micro-ops exit the [[re-order buffer]] (ROB) and enter a reserve station (RS), where they await dispatch to the execution units. In each clock cycle, up to five micro-ops can be dispatched to five execution units. The Pentium Pro has a total of six execution units: two integer units, one [[floating-point unit]] (FPU), a load unit, store address unit, and a store data unit.<ref name="iaopt">{{cite web |url=ftp://download.intel.com/design/PentiumII/manuals/24281603.PDF |archive-url=https://web.archive.org/web/20070121103522/http://download.intel.com:80/design/PentiumII/manuals/24281603.PDF |archive-date=2007-01-21 |url-status=dead |title=Intel Architecture Optimization Manual |page=2{{hyp}}8 |date=1997 }}</ref> One of the integer units shares the same ports as the FPU, and therefore the Pentium Pro can only dispatch one integer micro-op and one floating-point micro-op, or two integer micro-ops per a cycle, in addition to micro-ops for the other three execution units. Of the two integer units, only the one that shares the path with the FPU on port 0 has the full complement of functions such as a [[barrel shifter]], multiplier, divider, and support for LEA instructions. The second integer unit, which is connected to port 1, does not have these facilities and is limited to simple operations such as add, subtract, and the calculation of branch target addresses.<ref name="iaopt"/> The FPU executes floating-point operations. Addition and multiplication are pipelined and have a latency of three and five cycles, respectively. Division and square-root are not pipelined and are executed in separate units that share the FPU's ports. Division and square root have a latency of 18-36 and 29-69 cycles, respectively. The smallest number is for single precision (32-bit) floating-point numbers and the largest for extended precision (80-bit) numbers. Division and square root can operate simultaneously with adds and multiplies, preventing them from executing only when the result has to be stored in the ROB. After the microprocessor was released, a bug was discovered in the [[floating point unit]], commonly called the "Pentium Pro and Pentium II FPU bug" and by Intel as the "flag erratum". The bug occurs under some circumstances during floating point-to-integer conversion when the floating point number will not fit into the smaller integer format, causing the FPU to deviate from its documented behaviour. The bug is considered to be minor and occurs under such special circumstances that very few, if any, software programs are affected. The Pentium Pro [[P6 (microarchitecture)|P6 microarchitecture]] was used in one form or another by Intel for more than a decade. The pipeline would scale from its initial 150 MHz start, all the way up to 1.4 GHz with the "Tualatin" [[Pentium III]]. The design's various traits would continue after that in the derivative core called "[[Banias (microprocessor)|Banias]]" in [[Pentium M]] and [[Intel Core]] ([[Yonah (microprocessor)|Yonah]]), which itself would evolve into the [[Intel Core (microarchitecture)|Core microarchitecture]] ([[Core 2]] processor) in 2006 and onward.{{r|Stokes_20060405}} ===Instruction set=== The Pentium Pro (P6) introduced new 'conditional move' instructions into the Intel range; the <code>CMOV''cc''</code> and <code>FCMOV''cc''</code> (‘conditional move’) instructions fetch a source value from a register or memory, and optionally write that value to a destination register according to a condition ''cc'' on the flags register, the same conditions used by the conditional jump (<code>J''cc''</code>) instructions. For example, <code>CMOVNE</code> moves a specified value into a register if the flags register matches the NE (not-equal) condition, i.e. the [[zero flag]] is unset. If the zero flag is set, the condition in false, and the destination register keeps its value. This allows simple if-then-else operations (such as commonly used by the [[Ternary conditional operator|<code>? :</code> operation]] in [[C (programming language)|C]]) without a costly conditional branch. The <code>FCMOV''cc''</code> variant provides the same functionality for floating-point registers. Unfortunately, <code>CMOV</code> does not support immediate (in-line constant) source values nor memory destinations. A second development was the documentation of the <code>UD2</code> illegal instruction. This op code is reserved and guaranteed to cause an illegal instruction exception on the P6 and all later processors. This allows developers to easily crash the current program in a future-proof fashion when a bug is detected by software. ===Performance=== Despite being advanced for the time, the Pentium Pro's out-of-order register renaming architecture had trouble running [[16-bit computing|16-bit]] code and mixed code ([[8-bit computing|8-bit]] with 16-bit (8/16), or 16-bit with [[32-bit computing|32-bit]] (16/32), as using partial registers cause frequent pipeline flushing.<ref>{{cite web |url=http://qcd.phys.cmu.edu/QCDcluster/intel/vtune/reference/LipsPro_Partial_Stall.htm |title=Partial Register Stall Warning |work=VTune Performance Analyzer online help |archive-url=https://web.archive.org/web/20170830055933/http://qcd.phys.cmu.edu/QCDcluster/intel/vtune/reference/LipsPro_Partial_Stall.htm |archive-date=August 30, 2017 |url-status=dead}}</ref> Specific use of partial registers was then a common performance optimization, as it incurred no performance penalty on pre-P6 Intel processors; also, the dominant operating systems at the time of the Pentium Pro's release were 16-bit [[DOS]], and mixed 16/32-bit [[Windows 3.1x]] and [[Windows 95]] (although the latter requires a 32-bit [[i386|80386]] CPU as a minimum, much of its code is still 16-bit for performance reasons, such as the 16-bit [[Windows USER]] [[dynamic link library]], [[Windows USER#Implementation|user.exe]]). This, along with the high cost of Pentium Pro systems, led to tepid sales among PC buyers at the time. To fully use the Pentium Pro's [[P6 (microarchitecture)|P6 microarchitecture]], a fully 32-bit operating system is needed, such as [[Windows NT]], [[Linux]], [[Unix]], or [[OS/2]]. The performance issues on legacy code were later partly mitigated by Intel with the Pentium II. Compared to RISC microprocessors, the Pentium Pro, when introduced, slightly outperformed the fastest RISC microprocessors on integer performance when running the [[SPECint|SPECint95]] benchmark,{{r|MPR 1995-11-13|p=2}} but floating-point performance was significantly lower, half that of some RISC microprocessors.{{r|MPR 1995-11-13|p=3}} The Pentium Pro's integer performance lead disappeared rapidly, first overtaken by the [[MIPS Technologies]] [[R10000]] in January 1996, and then by [[Digital Equipment Corporation]]'s EV56 variant of the [[Alpha 21164]].<ref name="MPR 1996-07-08">{{cite magazine |last=Gwennap |first=Linley |date=July 8, 1996 |title=Digital's 21164 Reaches 500 MHz |magazine=[[Microprocessor Report]]}}</ref> Reviewers quickly noted the very slow writes to video memory as the weak spot of the P6 platform, with performance here being as low as 10% of an identically clocked Pentium system in benchmarks such as VIDSPEED. Methods to circumvent this included setting VESA drawing to system memory instead of video memory in games such as ''[[Quake (video game)|Quake]]'',<ref>{{Cite web|url=https://github.com/id-Software/Quake/blob/master/WinQuake/data/TECHINFO.TXT|title=Quake/TECHINFO.TXT at master · id-Software/Quake|website=[[GitHub]]|date=November 25, 2022|access-date=February 10, 2019|archive-date=June 10, 2017|archive-url=https://web.archive.org/web/20170610101303/https://github.com/id-Software/Quake/blob/master/WinQuake/data/TECHINFO.TXT|url-status=live}}</ref> and later on utilities such as FASTVID emerged, which could double performance in certain games by enabling the [[write combining]] features of the CPU.<ref>{{Cite web|url=http://www.gamers.org/dEngine/quake/info/techinfo.09|title=Quake Technical Information file}}</ref><ref>{{Cite web|url=https://www.mdgx.com/umb.htm#FAS|title=MDGx Complete UMBPCI.SYS Guide|at=Fast Video|website=MDGx MAX Speed WinDOwS Tricks + Secrets|access-date=January 7, 2023|archive-date=January 7, 2023|archive-url=https://web.archive.org/web/20230107204339/https://www.mdgx.com/umb.htm#FAS|url-status=live}}</ref> [[Memory type range register]]s (MTRRs) are set automatically by Windows video drivers starting from 1997, and from there the improved cache/memory subsystem and FPU performance caused it to outclass the Pentium clock-for-clock in the emerging 3D games of the mid–to–late 1990s, particularly when using [[Windows NT 4.0]]. However, its lack of [[MMX (instruction set)|MMX]] implementation reduces performance in multimedia applications that made use of those instructions. ===Caching=== Likely Pentium Pro's most noticeable addition was its on-package [[L2 cache]], which ranged from 256 KB at introduction to 1 MB in 1997. At the time, manufacturing technology did not feasibly allow a large L2 cache to be integrated into the processor core. Intel instead placed the L2 die(s) separately in the package which still allowed it to run at the same clock speed as the CPU core. Additionally, unlike most motherboard-based cache schemes that shared the main system bus with the CPU, the Pentium Pro's cache had its own [[back-side bus]] (called ''[[dual independent bus]]'' by Intel). Because of this, the CPU could read main memory and cache concurrently, greatly reducing a traditional bottleneck.<ref>{{cite magazine |title=Accelerated Graphics Port |magazine=[[Next Generation (magazine)|Next Generation]]|issue=37|publisher=[[Imagine Media]] |date=January 1998 |pages=94–96}}</ref> The cache was also "non-blocking", meaning that the processor could issue more than one cache request at a time (up to 4), reducing cache-miss penalties; an example of [[memory-level parallelism]] (MLP). These properties combined to produce an L2 cache that was immensely faster than the motherboard-based caches of older processors. This cache alone gave the CPU an advantage in input/output performance over older [[x86]] CPUs. In multiprocessor configurations, Pentium Pro's integrated cache skyrocketed performance in comparison to architectures which had each CPU sharing a central cache. However, this far faster L2 cache did come with some complications. The Pentium Pro's "on-package cache" arrangement was unique. The processor and the cache were on separate dies in the same package and connected closely by a full-speed bus. The two or three dies had to be bonded together early in the production process, before testing was possible. This meant that a single, tiny flaw in either die made it necessary to discard the entire assembly, which was one of the reasons for the Pentium Pro's relatively low production yield and high cost. All versions of the chip were expensive, those with 1024 KB being particularly so, since it required two 512 KB cache dies as well as the processor die.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)