Editing Optimizing compiler (section)

== Specific techniques ==
=== Loop optimizations ===
{{Main|Loop optimization}}
''Loop optimization'' acts on the statements that make up a loop, such as a ''for'' loop, for example [[loop-invariant code motion]]. Loop optimizations can have a significant impact because many programs spend a large percentage of their time inside loops.<ref name="aho-sethi-ullman" />{{rp|page=596}}

Some optimization techniques primarily designed to operate on loops include:

;[[Induction variable analysis]]: Roughly, if a variable in a loop is a simple linear function of the index variable, such as <code>j := 4*i + 1</code>, it can be updated appropriately each time the loop variable is changed. This is a [[strength reduction]] and also may allow the index variable's definitions to become [[dead code]].<ref name="aho-sethi-ullman" />{{rp|pages=596–598}} This information is also useful for [[bounds-checking elimination]] and [[dependence analysis]], among other things.

;[[Loop fission]] or loop distribution: Loop fission attempts to break a loop into multiple loops over the same index range with each new loop taking only a part of the original loop's body. This can improve [[locality of reference]] to both the data being accessed within the loop and the code in the loop's body.

;[[Loop fusion]] or loop combining or loop ramming or loop jamming: Another technique that attempts to reduce loop overhead. When two adjacent loops would iterate the same number of times regardless of whether that number is known at compile time, their bodies can be combined as long as they do not refer to each other's data.

;[[Loop inversion]]: This technique changes a standard ''while'' loop into a ''do/while'' (also known as ''repeat/until'') loop wrapped in an ''if'' conditional, reducing the number of jumps by two, for cases when the loop is executed. Doing so duplicates the condition check (increasing the size of the code) but is more efficient because jumps usually cause a [[pipeline stall]]. Additionally, if the initial condition is known at compile-time and is known to be [[Side effect (computer science)|side-effect]]-free, the ''if'' guard can be skipped.

;[[Loop interchange]]: These optimizations exchange inner loops with outer loops. When the loop variables index into an array, such a transformation can improve the locality of reference, depending on the array's layout.

;[[Loop-invariant code motion]]: If a quantity is computed inside a loop during every iteration, and its value is the same for each iteration, it can vastly improve efficiency to hoist it outside the loop and compute its value just once before the loop begins.<ref name="aho-sethi-ullman" />{{rp|page=596}} This is particularly important with the address-calculation expressions generated by loops over arrays. For correct implementation, this technique must be used with [[loop inversion]], because not all code is safe to be hoisted outside the loop.

;[[Loop nest optimization]]: Some pervasive algorithms such as matrix multiplication have very poor cache behavior and excessive memory accesses. Loop nest optimization increases the number of cache hits by operating over small blocks and by using a loop interchange.

;Loop reversal: Loop reversal reverses the order in which values are assigned to the index variable. This is a subtle optimization that can help eliminate [[dependence analysis|dependencies]] and thus enable other optimizations. Furthermore, on some architectures, loop reversal contributes to smaller code, as when the loop index is being decremented, the condition that needs to be met for the running program to exit the loop is a comparison with zero. This is often a special, parameter-less instruction, unlike a comparison with a number, which needs the number to compare to. Therefore, the amount of bytes needed to store the parameter is saved by using the loop reversal. Additionally, if the comparison number exceeds the size of word of the platform, in standard loop order, multiple instructions would need to be executed to evaluate the comparison, which is not the case with loop reversal.

;[[Loop unrolling]]: Unrolling duplicates the body of the loop multiple times, to decrease the number of times the loop condition is tested and the number of jumps; tests and jumps can hurt performance by impairing the instruction pipeline. A "fewer jumps" optimization. Completely unrolling a loop eliminates all overhead, but requires that the number of iterations be known at compile time.

;[[Loop splitting]]: Loop splitting attempts to simplify a loop or eliminate dependencies by breaking it into multiple loops that have the same bodies but iterate over different contiguous portions of the index range. A useful special case is ''[[loop peeling]]'', which can simplify a loop with a problematic first iteration by performing that iteration separately before entering the loop.

;[[Loop unswitching]]: Unswitching moves a conditional from inside a loop to outside the loop by duplicating the loop's body inside each of the if and else clauses of the conditional.

;[[Software pipelining]]: The loop is restructured in such a way that work done in an iteration is split into several parts and done over several iterations. In a tight loop, this technique hides the latency between loading and using values.

;[[Automatic parallelization]]: A loop is converted into multi-threaded or vectorized (or even both) code to use multiple processors simultaneously in a shared-memory multiprocessor (SMP) machine, including multi-core machines.

===Prescient store optimizations===
{{anchor|Prescient store}}Prescient store optimizations allow store operations to occur earlier than would otherwise be permitted in the context of [[Threads (computer science)|threads]] and locks. The process needs some way of knowing ahead of time what value will be stored by the assignment that it should have followed. The purpose of this relaxation is to allow compiler optimization to perform certain kinds of code rearrangements that preserve the semantics of properly synchronized programs.<ref>{{cite book|chapter-url=http://titanium.cs.berkeley.edu/doc/java-langspec-1.0/17.doc.html#45376|chapter=17 Threads and Locks|at=17.8 Prescient Store Actions|title=The Java Language Specification|url=http://titanium.cs.berkeley.edu/doc/java-langspec-1.0/index.html|author1=James Gosling|author1-link=James Gosling|author2=Bill Joy|author2-link=Bill Joy|author3=Guy Steele|author3-link=Guy L. Steele Jr.|edition=1.0}}</ref>

=== Data-flow optimizations ===
[[Dataflow|Data-flow]] optimizations, based on [[data-flow analysis]], primarily depend on how certain properties of data are propagated by control edges in the [[control-flow graph]]. Some of these include:

;[[Common subexpression elimination]]: In the expression <code>(a + b) - (a + b)/4</code>, "common subexpression" refers to the duplicated <code>(a + b)</code>. Compilers implementing this technique realize that <code>(a + b)</code> will not change, and so only calculate its value once.<ref name="aho-sethi-ullman" />{{rp|pages=592–594}}

;[[Constant folding|Constant folding and propagation]]: Replacing expressions consisting of constants (e.g., <code>3 + 5</code>) with their final value (<code>8</code>) at compile time, rather than doing the calculation in run-time.<ref name="MuchnickAssociates1997">{{cite book|author1=Muchnick|first=Steven|url=https://archive.org/details/advancedcompiler00much|title=Advanced Compiler Design Implementation|author2=Muchnick and Associates|date=15 August 1997|publisher=Morgan Kaufmann|isbn=978-1-55860-320-2|pages=[https://archive.org/details/advancedcompiler00much/page/329 329]–|quote=constant folding.|url-access=registration}}</ref> Used in most modern languages.

;[[Induction variable recognition and elimination]]: See discussion above about ''induction variable analysis''.

;[[Strict aliasing|Alias classification and pointer analysis]]: In the presence of [[pointer (computer programming)|pointer]]s, it is difficult to make any optimizations at all, since potentially any variable can have been changed when a memory location is assigned to. By specifying which pointers can alias which variables, unrelated pointers can be ignored.

;[[Dead store|Dead-store]] elimination: Removal of assignments to variables that are not subsequently read, either because the lifetime of the variable ends or because of a subsequent assignment that will overwrite the first value.

=== SSA-based optimizations ===
These optimizations are intended to be done after transforming the program into a special form called [[SSA (compilers)|Static Single Assignment]], in which every variable is assigned in only one place. Although some function without SSA, they are most effective with SSA. Many optimizations listed in other sections also benefit with no special changes, such as register allocation.

;[[Global value numbering]]: GVN eliminates redundancy by constructing a value graph of the program, and then determining which values are computed by equivalent expressions. GVN can identify some redundancy that [[common subexpression elimination]] cannot, and vice versa.

;[[Sparse conditional constant propagation]]: Combines constant propagation, [[constant folding]], and [[dead-code elimination]], and improves upon what is possible by running them separately.<ref>{{cite journal|last1=Wegman|first1=Mark N.|last2=Zadeck|first2=F. Kenneth|url=https://bears.ece.ucsb.edu/class/ece253/papers/wegman91.pdf|title=Constant Propagation with Conditional Branches|journal=[[ACM Transactions on Programming Languages and Systems]]|volume=13|issue=2|date=April 1991|pages=181–210|doi=10.1145/103135.103136}}</ref><ref>{{cite journal|last1=Click|first1=Clifford|last2=Cooper|first2=Keith.|url=https://scholarship.rice.edu/bitstream/handle/1911/96451/TR95-252.pdf?sequence=1&isAllowed=y|title=Combining Analyses, Combining Optimizations|journal=ACM Transactions on Programming Languages and Systems|volume=17|issue=2|date=March 1995|pages=181–196|doi=10.1145/201059.201061}}</ref> This optimization symbolically executes the program, simultaneously propagating constant values and eliminating portions of the [[control-flow graph]] that this makes unreachable.

=== Code generator optimizations ===
;[[Register allocation]]: The most frequently used variables should be kept in processor registers for the fastest access. To find which variables to put in registers, an interference-graph is created. Each variable is a vertex and when two variables are used at the same time (have an intersecting liverange) they have an edge between them. This graph is colored using for example [[Chaitin's algorithm]] using the same number of colors as there are registers. If the coloring fails one variable is "spilled" to memory and the coloring is retried.
;[[Instruction selection]]: Most architectures, particularly [[Complex instruction set computer|CISC]] architectures and those with many [[addressing mode]]s, offer several different ways of performing a particular operation, using entirely different sequences of instructions. The job of the instruction selector is to do a good job overall of choosing which instructions to implement which operators in the low-level [[intermediate representation]] with. For example, on many processors in the [[68000 family]] and the x86 architecture, complex addressing modes can be used in statements like <code>lea 25(a1,d5*4), a0</code>, allowing a single instruction to perform a significant amount of arithmetic with less storage.
;[[Instruction scheduling]]: Instruction scheduling is an important optimization for modern pipelined processors, which avoids stalls or bubbles in the pipeline by clustering instructions with no dependencies together, while being careful to preserve the original semantics.
;[[Rematerialization]]: Rematerialization recalculates a value instead of loading it from memory, eliminating an access to memory. This is performed in tandem with register allocation to avoid spills.
;Code factoring: If several sequences of code are identical, or can be parameterized or reordered to be identical, they can be replaced with calls to a shared subroutine. This can often share code for subroutine set-up and sometimes tail-recursion.<ref name="keil">Cx51 Compiler Manual, version 09.2001, p. 155, Keil Software Incorporated.</ref>
;[[Trampoline (computing)|Trampolines]]: Many{{Citation needed|date=January 2018}} CPUs have smaller subroutine call instructions to access low memory. A compiler can save space by using these small calls in the main body of code. Jump instructions in low memory can access the routines at any address. This multiplies space savings from code factoring.<ref name="keil" />
;Reordering computations: Based on [[integer linear programming]], restructuring compilers enhance data locality and expose more parallelism by reordering computations. Space-optimizing compilers may reorder code to lengthen sequences that can be factored into subroutines.

=== Functional language optimizations ===
Although many of these also apply to non-functional languages, they either originate in or are particularly critical in [[functional language]]s such as [[Lisp programming language|Lisp]] and [[ML programming language|ML]].

;[[Tail-call optimization]]: A function call consumes stack space and involves some overhead related to parameter passing and flushing the instruction cache. [[Tail recursion|Tail-recursive]] algorithms can be converted to [[iteration]] through a process called tail-recursion elimination or tail-call optimization.

;[[Deforestation (computer science)|Deforestation]] ([[data structure]] fusion): In languages where it is common for a sequence of transformations to be applied to a list, deforestation attempts to remove the construction of intermediate data structures.

;[[Partial evaluation]]: Computations that produce the same output regardless of the dynamic input at runtime can be evaluated at compile time.

=== Other optimizations ===
;[[Bounds-checking elimination]]: Many languages, such as [[Java (programming language)|Java]], enforce [[bounds checking]] of all array accesses. This is a severe performance [[Bottleneck (software)|bottleneck]] on certain applications such as scientific code. Bounds-checking elimination allows the compiler to safely remove bounds checking in many situations where it can determine that the index must fall within valid bounds; for example, if it is a simple loop variable.

;Branch-offset optimization (machine dependent): Choose the shortest branch displacement that reaches the target.

;Code-block reordering: Code-block reordering alters the order of the basic [[Block (programming)|blocks]] in a program to reduce conditional branches and improve the locality of reference.

;[[Dead-code elimination]]: Removes instructions that will not affect the behaviour of the program, for example, definitions that have no uses, called [[dead code]]. This reduces code size and eliminates unnecessary computation.

;Factoring out of invariants ([[loop invariant]]s): If an expression is carried out both when a condition is met and is not met, it can be written just once outside of the conditional statement. Similarly, if certain types of expressions (e.g., the assignment of a constant into a variable) appear inside a loop, they can be moved out of it because their effect will be the same no matter if they're executed many times or just once. This is also known as total redundancy elimination. A similar but more powerful optimization is [[partial-redundancy elimination]] (PRE).

;[[Inline expansion]] or [[Macro (computer science)|macro]] expansion: When some code invokes a [[Subroutine|procedure]], it is possible to directly insert the body of the procedure inside the calling code rather than transferring control to it. This saves the overhead related to procedure calls, as well as providing an opportunity for many different parameter-specific optimizations, but comes at the cost of space; the procedure body is duplicated each time the procedure is called inline. Generally, inlining is useful in performance-critical code that makes a large number of calls to small procedures. This is a "fewer jumps" optimization. The [[Statement (computer science)|statements]] of [[imperative programming]] languages are also an example of such an optimization. Although statements could be implemented with [[subroutine|function calls]] they are almost always implemented with code inlining.

;[[Jump threading]]: In this optimization, consecutive conditional jumps predicated entirely or partially on the same condition are merged. 
: E.g., {{c-lang|if (c) { foo; } if (c) {{(}} bar; {{)}}}} to {{c-lang|if (c) {{(}} foo; bar; {{)}}}}, 
:and {{c-lang|if (c) { foo; } if (!c) {{(}} bar; {{)}}}} to {{c-lang|if (c) { foo; } else {{(}} bar; {{)}}}}.

;Macro compression: A space optimization that recognizes common sequences of code, creates subprograms ("code macros") that contain the common code, and replaces the occurrences of the common code sequences with calls to the corresponding subprograms.<ref name="MCO"/> This is most effectively done as a [[machine code]] optimization, when all the code is present. The technique was first used to conserve space in an interpretive [[byte stream]] used in an implementation of [[SPITBOL compiler|Macro Spitbol]] on [[microcomputers]].<ref name="MicroSpitbol">
{{cite book|last1=Dewar|first1=Robert B. K.|author-link1=Robert Dewar|title=MICRO SPITBOL|last2=Golumbic|first2=Martin Charles|author-link2=Martin Charles Golumbic|last3=Goss|first3=Clinton F.|date=August 2013|publisher=Courant Institute of Mathematical Sciences|series=Computer Science Department Technical Report|volume=11|bibcode=2013arXiv1308.6096D|orig-date=October 1979|arxiv=1308.6096}}
</ref> The problem of determining an optimal set of macros that minimizes the space required by a given code segment is known to be [[NP-complete]],<ref name="MCO"/> but efficient heuristics attain near-optimal results.<ref>{{cite conference|last1=Golumbic|first1=Martin Charles|author-link1=Martin Charles Golumbic|last2=Dewar|first2=Robert B. K.|author-link2=Robert Dewar|last3=Goss|first3=Clinton F.|year=1980|title=Macro Substitutions in MICRO SPITBOL – a Combinatorial Analysis|conference=11th Southeastern Conference on Combinatorics, Graph Theory and Computing|volume=29|pages=485–495|book-title=Proceedings of the 11th Southeastern Conference on Combinatorics, Graph Theory and Computing, Congressus Numerantium, Utilitas Math., Winnipeg, Canada}}</ref>

;Reduction of [[cache (computing)|cache]] collisions: (e.g., by disrupting alignment within a page)

;[[Call stack|Stack]]-height reduction: Rearrange an expression tree to minimize resources needed for expression evaluation.{{Clarify|date=December 2023|reason=On a [[stack machine]], it can reduce the depth of the evaluation stack, but not all machines use an evaluation stack - this might, for example, reduce register usage and thus register pressure.}}

;Test reordering: If we have two tests that are the condition for something, we can first deal with the simpler tests (e.g., comparing a variable to something) and only then with the complex tests (e.g., those that require a function call). This technique complements [[lazy evaluation]], but can be used only when the tests are not dependent on one another. [[Minimal evaluation|Short-circuiting]] semantics can make this difficult.

=== Interprocedural optimizations ===
[[Interprocedural optimization]] works on the entire program, across procedure and file boundaries. It works tightly with intraprocedural counterparts, carried out with the cooperation of a local part and a global part. Typical interprocedural optimizations are procedure [[inlining]], interprocedural dead-code elimination, interprocedural constant propagation, and procedure reordering. As usual, the compiler needs to perform interprocedural analysis before its actual optimizations. Interprocedural analyses include alias analysis, [[array access analysis]], and the construction of a [[call graph]].

Interprocedural optimization is common in modern commercial compilers from [[Silicon Graphics|SGI]], [[Intel]], [[Microsoft]], and [[Sun Microsystems]]. For a long time, the open source [[GNU Compiler Collection|GCC]] was criticized for a lack of powerful interprocedural analysis and optimizations, though this is now improving.<ref>{{Cite arXiv|last=Glazunov|first=N. M.|date=November 25, 2012|title=Foundations of Scientific Research|class=cs.OH|eprint=1212.1651}}</ref> Another open-source compiler with full analysis and optimization infrastructure is [[Open64]].

Due to the extra time and space required by interprocedural analysis, most compilers do not perform it by default. Users must use compiler options explicitly to tell the compiler to enable interprocedural analysis and other expensive optimizations.