Editing Classic RISC pipeline (section)

==The classic five stage RISC pipeline==
[[Image:Fivestagespipeline.png|thumb|400px|Basic five-stage pipeline in a [[RISC]] machine (IF = [[Instruction fetch|Instruction Fetch]], ID = Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back).  The vertical axis is successive instructions; the horizontal axis is time.  So in the green column, the earliest instruction is in WB stage, and the latest instruction is undergoing instruction fetch.]]

===Instruction fetch===
The instructions reside in memory that takes one cycle to read.  This memory can be dedicated to SRAM, or an Instruction [[Cache (computing)|Cache]].  The term "latency" is used in computer science often and means the time from when an operation starts until it completes.  Thus, instruction fetch has a latency of one [[clock cycle]] (if using single-cycle SRAM or if the instruction was in the cache).  Thus, during the [[Instruction fetch|Instruction Fetch]] stage, a 32-bit instruction is fetched from the instruction memory.

The [[program counter]] (PC) is a register that holds the address that is presented to the instruction memory. The address is presented to instruction memory at the start of a cycle. Then during the cycle, the instruction is read out of instruction memory, and at the same time, a calculation is done to determine the next PC. The next PC is calculated by incrementing the PC by 4, and by choosing whether to take that as the next PC or to take the result of a branch/jump calculation as the next PC.  Note that in classic RISC, all instructions have the same length. (This is one thing that separates RISC from CISC<ref>{{cite web |first=David |last=Patterson| title=RISC I: A Reduced Instruction Set VLSI Computer |series=Isca '81|date=12 May 1981|pages=443–457|url=https://dl.acm.org/doi/10.5555/800052.801895}}</ref>).  In the original RISC designs, the size of an instruction is 4 bytes, so always add 4 to the instruction address, but don't use PC + 4 for the case of a taken branch, jump, or exception (see '''delayed branches''', below).  (Note that some modern machines use more complicated algorithms ([[branch prediction]] and [[branch target predictor|branch target prediction]]) to guess the next instruction address.)

===Instruction decode===
Another thing that separates the first RISC machines from earlier CISC machines, is that RISC has no [[microcode]].<ref>{{cite web |first=David |last=Patterson| title=RISC I: A Reduced Instruction Set VLSI Computer |series=Isca '81|date=12 May 1981|pages=443–457|url=https://dl.acm.org/doi/10.5555/800052.801895}}</ref>  In the case of CISC micro-coded instructions, once fetched from the instruction cache, the instruction bits are shifted down the pipeline, where simple combinational logic in each pipeline stage produces control signals for the datapath directly from the instruction bits.  In those CISC designs, very little decoding is done in the stage traditionally called the decode stage.  A consequence of this lack of decoding is that more instruction bits have to be used to specifying what the instruction does. That leaves fewer bits for things like register indices.

All MIPS, SPARC, and DLX instructions have at most two register inputs. During the decode stage, the indexes of these two registers are identified within the instruction, and the indexes are presented to the register memory, as the address.  Thus the two registers named are read from the [[register file]].  In the MIPS design, the register file had 32 entries.

At the same time the register file is read, instruction issue logic in this stage determines if the pipeline is ready to execute the instruction in this stage.  If not, the issue logic causes both the Instruction Fetch stage and the Decode stage to stall.  On a stall cycle, the input flip flops do not accept new bits, thus no new calculations take place during that cycle.

If the instruction decoded is a branch or jump, the target address of the branch or jump is computed in parallel with reading the register file.  The branch condition is computed in the following cycle (after the register file is read), and if the branch is taken or if the instruction is a jump, the PC in the first stage is assigned the branch target, rather than the incremented PC that has been computed. Some architectures made use of the [[Arithmetic logic unit]] (ALU) in the Execute stage, at the cost of slightly decreased instruction throughput.

The decode stage ended up with quite a lot of hardware: MIPS has the possibility of branching if two registers are equal, so a 32-bit-wide AND tree runs in series after the register file read, making a very long critical path through this stage (which means fewer cycles per second).  Also, the branch target computation generally required a 16 bit add and a 14 bit incrementer.  Resolving the branch in the decode stage made it possible to have just a single-cycle branch mis-predict penalty.  Since branches were very often taken (and thus mis-predicted), it was very important to keep this penalty low.

===Execute===
The Execute stage is where the actual computation occurs.  Typically this stage consists of an ALU, and also a bit shifter.  It may also include a multiple cycle multiplier and divider.

The ALU is responsible for performing Boolean operations (and, or, not, nand, nor, xor, xnor) and also for performing integer addition and subtraction.  Besides the result, the ALU typically provides status bits such as whether or not the result was 0, or if an overflow occurred.

The bit shifter is responsible for shift and rotations.

Instructions on these simple RISC machines can be divided into three latency classes according to the type of the operation:

* Register-Register Operation (Single-cycle latency):  Add, subtract, compare, and logical operations. During the execute stage, the two arguments were fed to a simple ALU, which generated the result by the end of the execute stage.
* Memory Reference (Two-cycle latency).  All loads from memory. During the execute stage, the ALU added the two arguments (a register and a constant offset) to produce a virtual address by the end of the cycle.
*[[Cycles per instruction|Multi-cycle Instructions]] (Many cycle latency).  Integer multiply and divide and all [[floating-point]] operations. During the execute stage, the operands to these operations were fed to the multi-cycle multiply/divide unit. The rest of the pipeline was free to continue execution while the multiply/divide unit did its work.  To avoid complicating the writeback stage and issue logic, multicycle instruction wrote their results to a separate set of registers.

===Memory access===

If data memory needs to be accessed, it is done in this stage.

During this stage, single cycle latency instructions simply have their results forwarded to the next stage. This forwarding ensures that both one and two cycle instructions always write their results in the same stage of the pipeline so that just one write port to the register file can be used, and it is always available.

For direct mapped and virtually tagged data caching, the simplest by far of the [[CPU cache|numerous data cache organizations]], two [[Static RAM|SRAMs]] are used, one storing data and the other storing tags.

===Writeback===

During this stage, both single cycle and two cycle instructions write their results into the register file.
Note that two different stages are accessing the register file at the same time—the decode stage is reading two source registers, at the same time that the writeback stage is writing a previous instruction's destination register.  
On real silicon, this can be a hazard (see below for more on hazards).  That is because one of the source registers being read in decode might be the same as the destination register being written in writeback.  When that happens, then the same memory cells in the register file are being both read and written the same time.  On silicon, many implementations of memory cells will not operate correctly when read and written at the same time.