Editing Superscalar processor (section)

==Limitations==
Available performance improvement from superscalar techniques is limited by three key areas:

* The degree of intrinsic parallelism in the instruction stream (instructions requiring the same computational resources from the CPU)
* The complexity and time cost of dependency checking logic and [[register renaming]] circuitry
* The branch instruction processing

Existing binary executable programs have varying degrees of intrinsic parallelism. In some cases instructions are not dependent on each other and can be executed simultaneously. In other cases they are inter-dependent: one instruction impacts either resources or results of the other. The instructions <code>a = b + c; d = e + f</code> can be run in parallel because none of the results depend on other calculations. However, the instructions <code>a = b + c; b = e + f</code> might not be runnable in parallel, depending on the order in which the instructions complete while they move through the units.

Although the instruction stream may contain no inter-instruction dependencies, a superscalar CPU must nonetheless check for that possibility, since there is no assurance otherwise and failure to detect a dependency would produce incorrect results.

No matter how advanced the [[semiconductor device fabrication|semiconductor process]] or how fast the switching speed, this places a practical limit on how many instructions can be simultaneously dispatched. While process advances will allow ever greater numbers of execution units (e.g. ALUs), the burden of checking instruction dependencies grows rapidly, as does the complexity of register renaming circuitry to mitigate some dependencies. Collectively the [[CPU power dissipation|power consumption]], complexity and gate delay costs limit the achievable superscalar speedup.

However even given infinitely fast dependency checking logic on an otherwise conventional superscalar CPU, if the instruction stream itself has many dependencies, this would also limit the possible speedup. Thus the degree of intrinsic parallelism in the code stream forms a second limitation.