Editing NetBurst (section)

== Technology ==
The NetBurst microarchitecture includes features such as [[Hyper-threading]], [[#Hyper Pipelined Technology|Hyper Pipelined Technology]], [[#Rapid Execution Engine|Rapid Execution Engine]], [[#Execution Trace Cache|Execution Trace Cache]], and [[replay system]] which all were introduced for the first time in this particular microarchitecture, and some never appeared again afterwards.

===Hyper-threading===
{{Main|Hyper-threading}}
Hyper-threading is Intel's proprietary [[simultaneous multithreading]] (SMT) implementation used to improve parallelization of computations (doing multiple tasks at once) performed on x86 processors. Intel introduced it with NetBurst processors in 2002. Later Intel reintroduced it in the [[Nehalem (microarchitecture)|Nehalem microarchitecture]] after its absence in the Core 2.

=== Quad-Pumped Front-Side Bus ===
The Northwood and Willamette cores feature an external Front Side Bus (FSB) that runs at 100&nbsp;MHz which transfers four bits per clock cycle, thus having an effective speed of 400 MHz. Later revisions of the Northwood core, along with the Prescott core ([[Pentium D|and derivatives]]) have an effective 800&nbsp;MHz front-side bus (200 MHz quad pumped). [https://arstechnica.com/uncategorized/2004/07/ask-ars-20040710/]

=== Hyper-Pipelined Technology ===
The Willamette and Northwood cores contain a 20-stage [[instruction pipelining|instruction pipeline]]. This is a significant increase in the number of stages compared to the Pentium III, which had only 10 stages in its pipeline. The Prescott core increased the length of the pipeline to 31 stages. A drawback of longer pipelines is the increase in the number of stages that need to be traced back in the event of a branch misprediction, increasing the penalty of said misprediction. To address this issue, Intel devised the Rapid Execution Engine and has invested a great deal into its branch prediction technology, which Intel claims reduces [[branch misprediction]]s by 33% over [[Pentium III]].<ref>{{cite web|date=November 20, 2000|title=The Trace Cache Branch Prediction Unit|url=https://www.tomshardware.com/reviews/intel,264-8.html|access-date=April 30, 2021|work=Intel's New Pentium 4 Processor|publisher=[[Tom's Hardware]]}}</ref> In reality, the longer pipeline resulted in reduced efficiency through a lower number of [[instructions per cycle|instructions per clock]] (IPC) executed as high enough clock speeds were not able to be reached to offset lost performance due to larger than expected increase in power consumption and heat.

=== Rapid Execution Engine ===
With this technology, the two [[arithmetic logic unit]]s (ALUs) in the core of the CPU are double-pumped, meaning that they actually operate at twice the core clock frequency. For example, in a 3.8&nbsp;GHz processor, the ALUs will effectively be operating at 7.6&nbsp;GHz. The reason behind this is to generally make up for the low IPC count; additionally this considerably enhances the integer performance of the CPU. Intel also replaced the high-speed [[barrel shifter]] with a shift/rotate execution unit that operates at the same frequency as the CPU core. The downside is that certain instructions are now much slower (relatively and absolutely) than before, making optimization for multiple target CPUs difficult. An example is shift and rotate operations, which suffer from the lack of a barrel shifter which was present on every x86 CPU beginning with the i386, including the main competitor processor, [[Athlon]].

=== Execution Trace Cache ===
{{Main|Trace cache}}
Within the L1 cache of the CPU, Intel incorporated its Execution Trace Cache. It stores decoded [[micro-operation]]s, so that when executing a new instruction, instead of fetching and decoding the instruction again, the CPU directly accesses the decoded micro-ops from the trace cache, thereby saving considerable time. Moreover, the micro-ops are cached in their predicted path of execution, which means that when instructions are fetched by the CPU from the cache, they are already present in the correct order of execution.<ref>{{cite web |url=https://www.tomshardware.com/reviews/intel,264-6.html |title=Entering The Execution Pipeline - Pentium 4's Trace Cache, Continued |work=Intel's New Pentium 4 Processor |publisher=[[Tom's Hardware]] |date=November 20, 2000 |access-date=April 30, 2021}}</ref> Intel later introduced a similar but simpler concept with [[Sandy Bridge]] called [[micro-operation cache]] (UOP cache).

=== Replay system ===
{{Main|Replay system}}
The replay system is a subsystem within the Intel Pentium 4 processor to catch operations that have been mistakenly sent for execution by the processor's scheduler. Operations caught by the replay system are then re-executed in a loop until the conditions necessary for their proper execution have been fulfilled.

=== Branch prediction hints ===
The Intel NetBurst architecture allows [[branch prediction]] hints to be inserted into the code to tell whether the static prediction should be taken or not taken, while this feature was abandoned in later Intel processors. According to Intel, NetBurst's branch prediction algorithm is 33% better than the one in P6.<ref name=Fog_Microarchitecture>{{cite web | last = Fog | first = Agner | title = The microarchitecture of Intel, AMD and VIA CPUs | date = December 1, 2016  | url = http://www.agner.org/optimize/microarchitecture.pdf | pages = 36 | access-date = March 22, 2017}}</ref><ref name="urlwww.ece.uah.edu">{{cite web |url=http://www.ece.uah.edu/~milenka/docs/milenkovic_WDDD02.pdf |title=Demystifying Intel Branch Predictors |first1=Milena |last1=Milenkovic |first2=Aleksandar |last2=Milenkovic |first3=Jeffrey |last3=Kulick}}</ref>