Editing Instruction pipelining (section)

==Concept and motivation==
In a pipelined computer, instructions flow through the [[central processing unit]] (CPU) in stages. For example, it might have one stage for each step of the [[von Neumann architecture|von Neumann cycle]]: Fetch the instruction, fetch the operands, do the instruction, write the results. A pipelined computer usually has "pipeline registers" after each stage. These store information from the instruction and calculations so that the [[logic gate]]s of the next stage can do the next step.

This arrangement lets the CPU complete an instruction on each clock cycle. It is common for even-numbered stages to operate on one edge of the square-wave clock, while odd-numbered stages operate on the other edge. This allows more [[CPU]] [[throughput]] than a multicycle computer at a given [[clock rate]], but may increase [[Latency (engineering)|latency]] due to the added overhead of the pipelining process itself. Also, even though the electronic logic has a fixed maximum speed, a pipelined computer can be made faster or slower by varying the number of stages in the pipeline. With more stages, each stage does less work, and so the stage has fewer delays from the [[logic gate]]s and could run at a higher clock rate.

A pipelined model of computer is often the most economical, when cost is measured as logic gates per instruction per second. At each instant, an instruction is in only one pipeline stage, and on average, a pipeline stage is less costly than a multicycle computer. Also, when made well, most of the pipelined computer's logic is in use most of the time. In contrast, out of order computers usually have large amounts of idle logic at any given instant. Similar calculations usually show that a pipelined computer uses less energy per instruction.

However, a pipelined computer is usually more complex and more costly than a comparable multicycle computer. It typically has more logic gates, registers and a more complex control unit. In a like way, it might use more total energy, while using less energy per instruction. Out of order CPUs can usually do more instructions per second because they can do several instructions at once.

In a pipelined computer, the control unit arranges for the flow to start, continue, and stop as a program commands. The instruction data is usually passed in pipeline registers from one stage to the next, with a somewhat separated piece of control logic for each stage. The control unit also assures that the instruction in each stage does not harm the operation of instructions in other stages. For example, if two stages must use the same piece of data, the control logic assures that the uses are done in the correct sequence.

When operating efficiently, a pipelined computer will have an instruction in each stage. It is then working on all of those instructions at the same time. It can finish about one instruction for each cycle of its clock. But when a program switches to a different sequence of instructions, the pipeline sometimes must discard the data in process and restart. This is called a "stall."

Much of the design of a pipelined computer prevents interference between the stages and reduces stalls.

==={{Anchor|SUPER}}Number of steps===
The number of dependent steps varies with the machine architecture. For example:
* The 1956–61 [[IBM Stretch]] project proposed the terms Fetch, Decode, and Execute that have become common.
* The [[classic RISC pipeline]] comprises:
*# Instruction fetch
*# Instruction decode and register fetch
*# Execute
*# Memory access
*# Register write back
* The [[Atmel AVR]] and the [[PIC microcontroller]] each have a two-stage pipeline.
* Many designs include pipelines as long as 7, 10 and even 20 stages (as in the [[Intel]] [[Pentium 4]]).
* The later "Prescott" and "Cedar Mill" [[NetBurst]] cores from Intel, used in the last Pentium&nbsp;4 models and their [[Pentium D]] and [[Xeon]] derivatives, have a long 31-stage pipeline.
* The Xelerated X10q Network Processor has a pipeline more than a thousand stages long, although in this case 200 of these stages represent independent CPUs with individually programmed instructions. The remaining stages are used to coordinate accesses to memory and on-chip function units.<ref>{{cite journal|last1=Glaskowsky|first1=Peter|title=Xelerated's Xtraordinary NPU — World's First 40Gb/s Packet Processor Has 200 CPUs|journal=Microprocessor Report|date=Aug 18, 2003|volume=18|issue=8|pages=12–14|url=http://www.linleygroup.com/mpr/h/2003/0818/173301.html|access-date=20 March 2017}}</ref><ref>{{cite web | url=https://www.eetimes.com/xelerated-brings-programmable-40-gbits-s-technology-to-the-mainstream-ethernet/# | title=Xelerated Brings Programmable 40 Gbits/S Technology to the Mainstream Ethernet | date=31 May 2003 }}</ref>{{More citations needed|date=October 2020}}

As the pipeline is made "deeper" (with a greater number of dependent steps), a given step can be implemented with simpler circuitry, which may let the processor clock run faster.<ref name=Guardian>{{cite book |url=https://books.google.com/books?id=Nibfj2aXwLYC&q=deep%20pipeline%20processor&pg=PA94 |title=Modern Processor Design |author=John Paul Shen, Mikko H. Lipasti |year=2004 |publisher=[[McGraw-Hill Professional]]|isbn=9780070570641 }}</ref> Such pipelines may be called ''superpipelines.''<ref>{{cite book |url=https://books.google.com/books?id=xgtTAAAAMAAJ&q=%22a+superpipeline+is+essentially+a+deep+instruction+pipeline+with+many+stages%22 |title=Design of Computers and Other Complex Digital Devices |author=Sunggu Lee |year=2000 |publisher=[[Prentice Hall]]|isbn=9780130402677 }}</ref>

A processor is said to be ''fully pipelined'' if it can fetch an instruction on every cycle. Thus, if some instructions or conditions require delays that inhibit fetching new instructions, the processor is not fully pipelined.