Editing Register renaming (section)

== Problem approach ==
Programs are composed of instructions which operate on values. The instructions must name these values in order to distinguish them from one another. A typical instruction might say: add <math>x</math> and <math>y</math> and put the result in <math>z</math>. In this instruction, <math>x</math>, <math>y</math> and <math>z</math> are the names of storage locations.

It is common for the values being manipulated to be used several times in succession. Register machines take advantage of this by introducing a number of [[processor register]]s, which are high-speed memory locations that hold these values. Instructions that use the registers as intermediate values will run much faster than those that access [[main memory]]. This performance increase is a key design element of the [[RISC]] processor design, which uses registers for all of its primary math and logical instructions. The collection of registers in a particular design is known as its ''register file''.

Individual registers in the file are referred to by number in the [[machine code]]. Encoding a number in the [[machine code]] requires several bits. For instance, in the [[Zilog Z80]] there were eight general-purpose registers in the file. To select one of eight values requires three bits, as 2<sup>3</sup> = 8. More registers in the file will result in better performance, as more temporary values can be held in the file and thus avoid the expensive operations of saving or loading from memory. Generally, more modern processors and those with larger instruction words will use more registers when possible. For example, the [[IA-32]] instruction set architecture has 8 general purpose registers, [[x86-64]] has 16, many [[Reduced instruction set computer|RISC]]s have 32, and [[IA-64]] has 128.

The advantages of a larger register file are offset by the need to use more bits to encode the register number. For instance, in a system using 32-bit instructions, you might wish to have three registers, such that you can perform operations of the type <math>z</math> = <math>x + y</math>. If the register file contains 32 entries, each one of the references will require 5 bits, and the set of three registers thus takes up 15 bits, leaving 17 to encode the operation and other information. Expanding the register file to 64 entries would require 6 bits, a total of 18 bits. While this may result in faster performance, it also means there are fewer bits left over for encoding the instruction. This leads to an effort to balance the size of the file with the number of possible instructions.

===Out-of-order===
Early computers often worked lock-step with their main memory, which reduced the advantages of large register files. A common design note from the [[minicomputer]] market of the 1960s was to have the registers be physically implemented in main memory, in which case the performance advantage was simply that the instruction could directly refer to the location rather than having to use a second byte or two to specify a complete memory address. This made the instructions smaller, and thus faster to read. This sort of design, which maximized performance by carefully tuning the instruction set for minimal size, was common until the 1980s. An example of this approach is the [[MOS 6502]], which had only a single register, in which case it is referred to as the [[Accumulator (computing)|accumulator]], and a special "zero page" addressing mode that treated the first 256 bytes of memory as if they were registers. Placing code and data in the zero page meant the instruction was only two bytes long instead of three, greatly improving performance through avoided reads.

The widespread introduction of [[dynamic RAM]] in the 1970s changed this approach. Over time, the performance of the [[central processing unit]]s (CPUs) increased relative to the memory they were attached to, it was no longer reasonable to use main memory as registers. This led to increasingly large register files, internal to the CPU, to avoid referring to memory wherever possible. However, it is not possible to avoid accessing memory entirely in practice, and as the speed difference grew, every such access became more and more expensive in terms of the number of instructions that might be performed had the value been in a register.

Different instructions may take different amounts of time; for example, a processor may be able to execute hundreds of register-to-register instructions while a single load from the main memory is in progress. A key advance in improving performance is to allow those fast instructions to be performed while the others are waiting for data. This means the instructions are no longer completed in the order they are specified in the machine code, they are instead performed [[out-of-order execution|out-of-order]].

Consider this piece of code running on an out-of-order CPU:
<syntaxhighlight lang="asm" line>
	r1 ≔ m[1024]     ;read the value in memory location 1024
	r1 ≔ r1 + 2      ;add two to the value
	m[1032] ≔ r1     ;save the result to location 1032
	r1 ≔ m[2048]     ;read value in 2048
	r1 ≔ r1 + 4      ;add 4
	m[2056] ≔ r1     ;save it to 2056
</syntaxhighlight>
The instructions in the final three lines are independent of the first three instructions, but the processor cannot finish <syntaxhighlight lang="asm" inline>r1 ≔ m[2048]</syntaxhighlight> until the preceding <syntaxhighlight lang="asm" inline>m[1032] ≔ r1</syntaxhighlight> is complete, as doing so would add four to the value of 1024, not 2048.

If another register is available, this restriction can be eliminated by choosing different registers for the first three and the second three instructions:
<syntaxhighlight lang="asm" line>
	r1 ≔ m[1024]
	r1 ≔ r1 + 2
	m[1032] ≔ r1
	r2 ≔ m[2048]
	r2 ≔ r2 + 4
	m[2056] ≔ r2
</syntaxhighlight>

Now the last three instructions can be executed in parallel with the first three. The program will run faster than before by eliminating the data dependency caused by unnecessarily using the same register in both sequences. A compiler can detect independent instruction sequences and, if there are registers that are available for use, choose different registers during [[register allocation]] in the [[Code generation (compiler)|code generation]] process.

However, to speed up code generated by compilers that do not perform that optimization, or code for which there were not sufficient registers to perform that optimization, many high-performance CPUs provide a register file with more registers than are specified in the instruction set, and, in hardware, rename references in instruction-set-defined registers to refer to registers in the register file, so that the original instruction sequence, using only r1, behaves as if it were:

<syntaxhighlight lang="asm" line>
	rA ≔ m[1024]
	rA ≔ rA + 2
	m[1032] ≔ rA
	rB ≔ m[2048]
	rB ≔ rB + 4
	m[2056] ≔ rB
</syntaxhighlight>

with register r1 "renamed" to the internal register rA for the first three instructions and to the internal register rB for the second three instructions.  This removes the false data dependency, allowing the first three instructions to be executed in parallel with the second three instructions.