Editing Out-of-order execution (section)

== Basic concept ==
=== Background ===
Out-of-order execution is more sophisticated relative to the baseline of in-order execution. In pipelined in-order execution processors, execution of instructions overlap in pipelined fashion with each requiring multiple [[clock cycle]]s to complete. The consequence is that results from a previous instruction will lag behind where they may be needed in the next. In-order execution still has to keep track of these dependencies. Its approach is however quite unsophisticated: stall, every time. Out-of-order uses much more sophisticated data tracking techniques, as described below.

=== In-order processors ===
In earlier processors, the processing of instructions is performed in an [[instruction cycle]] normally consisting of the following steps:
# [[Instruction (computer science)|Instruction]] fetch.
# If input [[operand]]s are available (in processor registers, for instance), the instruction is dispatched to the appropriate [[functional unit]]. If one or more operands are unavailable during the current clock cycle (generally because they must be fetched from [[Computer memory|memory]]), the processor stalls until they are available.
# The instruction is executed by the appropriate functional unit.
# The functional unit writes the results back to the [[register file]].

Often, an in-order processor has a [[bit vector]] recording which registers will be written to by a pipeline.<ref>{{cite web |url=https://pages.cs.wisc.edu/~swilson/gem5-docs/minor.html#sb |title=Inside the Minor CPU model: Scoreboard |author=<!--Not stated--> |date=2017-06-09 |access-date=2023-01-09}}</ref> If any input operands have the corresponding bit set in this vector, the instruction stalls. Essentially, the vector performs a greatly simplified role of protecting against register hazards. Thus out-of-order execution uses 2D matrices whereas in-order execution uses a 1D vector for hazard avoidance.

=== Out-of-order processors ===
This new paradigm breaks up the processing of instructions into these steps:<ref>{{Cite journal |last=González |first=Antonio |last2=Latorre |first2=Fernando |last3=Magklis |first3=Grigorios |date=2011 |title=Processor Microarchitecture |url=https://link.springer.com/book/10.1007/978-3-031-01729-2 |journal=Synthesis Lectures on Computer Architecture |language=en |doi=10.1007/978-3-031-01729-2 |issn=1935-3235}}</ref>
# Instruction fetch.
# Instruction decoding.
# Instruction renaming.
# Instruction dispatch to an instruction queue (also called instruction buffer or [[reservation station]]s).
# The instruction waits in the queue until its input operands are available. The instruction can leave the queue before older instructions.
# The instruction is issued to the appropriate functional unit and executed by that unit.
# The results are queued.
# Only after all older instructions have their results written back to the register file, then this result is written back to the register file. This is called the graduation or retire stage.

The key concept of out-of-order processing is to allow the processor to avoid a class of stalls that occur when the data needed to perform an operation are unavailable. In the outline above, the processor avoids the stall that occurs in step 2 of the in-order processor when the instruction is not completely ready to be processed due to missing data.

Out-of-order processors fill these ''slots'' in time with other instructions that ''are'' ready, then reorder the results at the end to make it appear that the instructions were processed as normal. The way the instructions are ordered in the original computer code is known as ''program order'', in the processor they are handled in ''data order'', the order in which the data becomes available in the processor's registers. Fairly complex circuitry is needed to convert from one ordering to the other and maintain a logical ordering of the output.

The benefit of out-of-order processing grows as the [[instruction pipeline]] deepens and the speed difference between [[main memory]] (or [[cache memory]]) and the processor widens. On modern machines, the processor runs many times faster than the memory, so during the time an in-order processor spends waiting for data to arrive, it could have theoretically processed a large number of instructions.