Editing Classic RISC pipeline (section)

==Hazards==

Hennessy and Patterson coined the term [[Hazard (computer architecture)|''hazard'']] for situations where instructions in a pipeline would produce wrong answers.

===Structural hazards===

'''Structural hazards occur when two instructions might attempt to use the same resources at the same time.'''  Classic RISC pipelines avoided these hazards by replicating hardware.  In particular, branch instructions could have used the ALU to compute the target address of the branch.  If the ALU were used in the decode stage for that purpose, an ALU instruction followed by a branch would have seen both instructions attempt to use the ALU simultaneously.  It is simple to resolve this conflict by designing a specialized branch target adder into the decode stage.

===Data hazards===

'''Data hazards occur when an instruction, scheduled blindly, would attempt to use data before the data is available in the register file.'''

In the classic RISC pipeline, Data hazards are avoided in one of two ways:

====Solution A. Bypassing====

Bypassing is also known as [[operand forwarding]].

Suppose the CPU is executing the following piece of code:
<syntaxhighlight lang="nasm">
SUB r3,r4 -> r10     ; Writes r3 - r4 to r10
AND r10,r3 -> r11    ; Writes r10 & r3 to r11
</syntaxhighlight>
The instruction fetch and decode stages send the second instruction one cycle after the first.  They flow down the pipeline as shown in this diagram:

[[File:Pipeline Data Hazard.svg|center]]

In a ''naive pipeline'', without hazard consideration, the data hazard progresses as follows:

In cycle 3, the <code>SUB</code> instruction calculates the new value for <code>r10</code>. In the same cycle, the <code>AND</code> operation is decoded, and the value of <code>r10</code> is fetched from the register file. However, the <code>SUB</code> instruction has not yet written its result to <code>r10</code>. Write-back of this normally occurs in cycle 5 (green box). Therefore, the value read from the register file and passed to the ALU (in the Execute stage of the <code>AND</code> operation, red box) is incorrect.

Instead, we must pass the data that was computed by <code>SUB</code> back to the Execute stage (i.e. to the red circle in the diagram) of the <code>AND</code> operation ''before'' it is normally written-back. The solution to this problem is a pair of bypass multiplexers.  These multiplexers sit at the end of the decode stage, and their flopped outputs are the inputs to the ALU.  Each multiplexer selects between:
# A register file read port (i.e. the output of the decode stage, as in the naive pipeline): {{red|red}} arrow
# The current register pipeline of the ALU (to bypass by one stage): {{blue|blue}} arrow
# The current register pipeline of the access stage (which is either a loaded value or a forwarded ALU result, this provides bypassing of two stages): {{purple|purple}} arrow. Note that this requires the data to be passed ''backwards'' in time by one cycle. If this occurs, a [[bubble (computing)|bubble]] must be inserted to stall the <code>AND</code> operation until the data is ready.

Decode stage logic compares the registers written by instructions in the execute and access stages of the pipeline to the registers read by the instruction in the decode stage, and cause the multiplexers to select the most recent data.  These bypass multiplexers make it possible for the pipeline to execute simple instructions with just the latency of the ALU, the multiplexer, and a flip-flop.  Without the multiplexers, the latency of writing and then reading the register file would have to be included in the latency of these instructions.

Note that the data can only be passed ''forward'' in time - the data cannot be bypassed back to an earlier stage if it has not been processed yet. In the case above, the data is passed forward (by the time the <code>AND</code> is ready for the register in the ALU, the <code>SUB</code> has already computed it).

[[File:Data Forwarding (One Stage).svg|center]]

====Solution B. Pipeline interlock====

However, consider the following instructions:
<syntaxhighlight lang="nasm">
LD  adr    -> r10
AND r10,r3 -> r11
</syntaxhighlight>
The data read from the address <code>adr</code> is not present in the data cache until after the Memory Access stage of the <code>LD</code> instruction. By this time, the <code>AND</code> instruction is already through the ALU. To resolve this would require the data from memory to be passed backwards in time to the input to the ALU. This is not possible. The solution is to delay the <code>AND</code> instruction by one cycle.  The data hazard is detected in the decode stage, and the fetch and decode stages are '''stalled''' - they are prevented from flopping their inputs and so stay in the same state for a cycle.  The execute, access, and write-back stages downstream see an extra no-operation instruction (NOP) inserted between the <code>LD</code> and <code>AND</code> instructions.

This NOP is termed a pipeline ''[[bubble (computing)|bubble]]'' since it floats in the pipeline, like an air bubble in a water pipe, occupying resources but not producing useful results. The hardware to detect a data hazard and stall the pipeline until the hazard is cleared is called a '''pipeline interlock'''.

{| align=center style="text-align:center"
|'''Bypassing backwards in time'''
|'''Problem resolved using a bubble'''
|-
|[[File:Data Forwarding (Two Stage, error).svg]]
|[[File:Data Forwarding (Two Stage).svg]]
|}

A pipeline interlock does not have to be used with any data forwarding, however. The first example of the <code>SUB</code> followed by <code>AND</code> and the second example of <code>LD</code> followed by <code>AND</code> can be solved by stalling the first stage by three cycles until write-back is achieved, and the data in the register file is correct, causing the correct register value to be fetched by the <code>AND</code>'s Decode stage. This causes quite a performance hit, as the processor spends a lot of time processing nothing, but clock speeds can be increased as there is less forwarding logic to wait for.

This data hazard can be detected quite easily when the program's machine code is written by the compiler.  The [[Stanford MIPS]] machine relied on the compiler to add the NOP instructions in this case, rather than having the circuitry to detect and (more taxingly) stall the first two pipeline stages.  Hence the name MIPS: Microprocessor without Interlocked Pipeline Stages.  It turned out that the extra NOP instructions added by the compiler expanded the program binaries enough that the instruction cache hit rate was reduced.  The stall hardware, although expensive, was put back into later designs to improve instruction cache hit rate, at which point the acronym no longer made sense.

===Control hazards===

'''Control hazards are caused by conditional and unconditional branching.''' The classic RISC pipeline resolves branches in the decode stage, which means the branch resolution recurrence is two cycles long. There are three implications:

* The branch resolution recurrence goes through quite a bit of circuitry: the instruction cache read, register file read, branch condition compute (which involves a 32-bit compare on the MIPS CPUs), and the next instruction address multiplexer.
* Because branch and jump targets are calculated in parallel to the register read, RISC ISAs typically do not have instructions that branch to a register+offset address. Jump to register is supported.
* On any branch taken, the instruction immediately after the branch is always fetched from the instruction cache. If this instruction is ignored, there is a one cycle per taken branch [[Instructions per cycle|IPC]] penalty, which is adequately large.

There are four schemes to solve this performance problem with branches:

* Predict Not Taken: Always fetch the instruction after the branch from the instruction cache, but only execute it if the branch is not taken.  If the branch is not taken, the pipeline stays full.  If the branch is taken, the instruction is flushed (marked as if it were a NOP), and one cycle's opportunity to finish an instruction is lost.
* Branch Likely: Always fetch the instruction after the branch from the instruction cache, but only execute it if the branch was taken.  The compiler can always fill the branch delay slot on such a branch, and since branches are more often taken than not, such branches have a smaller IPC penalty than the previous kind.
* [[Branch delay slot|Branch Delay Slot]]: Depending on the design of the delayed branch and the branch conditions, it is determined whether the instruction immediately following the branch instruction is executed even if the branch is taken. Instead of taking an IPC penalty for some fraction of branches either taken (perhaps 60%) or not taken (perhaps 40%), branch delay slots take an IPC penalty for those branches into which the compiler could not schedule the branch delay slot.  The SPARC, MIPS, and MC88K designers designed a branch delay slot into their ISAs.
* [[Branch Prediction]]: In parallel with fetching each instruction, guess if the instruction is a branch or jump, and if so, guess the target.  On the cycle after a branch or jump, fetch the instruction at the guessed target.  When the guess is wrong, flush the incorrectly fetched target.

Delayed branches were controversial, first, because their semantics are complicated.  A delayed branch specifies that the jump to a new location happens ''after'' the next instruction.  That next instruction is the one unavoidably loaded by the instruction cache after the branch.

Delayed branches have been criticized{{By whom|date=May 2012}} as a poor short-term choice in ISA design:
* Compilers typically have some difficulty finding logically independent instructions to place after the branch (the instruction after the branch is called the delay slot), so that they must insert NOPs into the delay slots.
* [[Superscalar]] processors, which fetch multiple instructions per cycle and must have some form of branch prediction, do not benefit from delayed branches. The [[DEC Alpha|Alpha]] ISA left out delayed branches, as it was intended for superscalar processors.
* The most serious drawback to delayed branches is the additional control complexity they entail. If the delay slot instruction takes an exception, the processor has to be restarted on the branch, rather than that next instruction. Exceptions then have essentially two addresses, the exception address and the restart address, and generating and distinguishing between the two correctly in all cases has been a source of bugs for later designs.