Editing Inline expansion (section)

==Effect on performance==
The direct effect of this optimization is to improve time performance (by eliminating call overhead), at the cost of worsening space usage{{efn|Space usage is "number of instructions", and is both runtime space usage and the [[binary file]] size.}} (due to [[code duplication|duplicating]] the function body). The code expansion due to duplicating the function body dominates, except for simple cases,{{efn|Code size actually shrinks for very short functions, where the call overhead is larger than the body of the function, or single-use functions, where no duplication occurs.}} and thus the direct effect of inline expansion is to improve time at the cost of space.

However, the main benefit of inline expansion is to allow further optimizations and improved scheduling, due to increasing the size of the function body, as better optimization is possible on larger functions.{{sfn|Chen|Chang|Conte|Hwu|1993|loc=3.4 Function inline expansion, p. 14}} The ultimate impact of inline expansion on speed is complex, due to multiple effects on performance of the memory system (mainly [[instruction cache]]), which dominates performance on modern processors: depending on the specific program and cache, inlining particular functions can increase or decrease performance.{{sfn|Chen|Chang|Conte|Hwu|1993}}

The impact of inlining varies by [[programming language]] and program, due to different degrees of abstraction. In lower-level imperative languages such as [[C (programming language)|C]] and [[Fortran]] it is typically a 10–20% speed boost, with minor impact on code size, while in more abstract languages it can be significantly more important, due to the number of layers inlining removes, with an extreme example being [[Self (programming language)|Self]], where one compiler saw improvement factors of 4 to 55 by inlining.{{sfn|Peyton Jones|Marlow|1999|loc=8. Related work, p. 17}}

The direct benefits of eliminating a function call are:
* It eliminates instructions needed for a [[function call]], both in the calling function and in the callee: placing arguments on a [[Stack-based memory allocation|stack]] or in [[Processor register|registers]], the function call itself, the [[function prologue]], then at return the [[function epilogue]], the [[return statement]], and then getting the return value back, and removing arguments from stacks and restoring registers (if needed).
* Due to not needing registers to pass arguments, it reduces [[register spilling]].
* It eliminates having to pass references and then dereference them, when using [[call by reference]] (or [[call by address]], or [[call by sharing]]).

The main benefit of inlining, however, is the further optimizations it allows. Optimizations that cross function boundaries can be done without requiring [[interprocedural optimization]] (IPO): once inlining has been performed, added ''intra''procedural optimizations ("global optimizations") become possible on the enlarged function body. For example:
* A [[Constant (computer programming)|constant]] passed as an argument can often be propagated to all instances of the matching parameter, or part of the function may be "hoisted out" of a loop (via [[loop-invariant code motion]]).
* [[Register allocation]] can be done across the larger function body.
* High-level optimizations, such as [[escape analysis]] and [[tail duplication]], can be performed on a larger scope and be more effective, more so if the compiler implementing those optimizations relies on mainly intra-procedural analysis.<ref name="prokopec2019"></ref> These can be done without inlining, but require a significantly more complex compiler and linker (in case caller and callee are in separate compiling units).

Conversely, in some cases a language specification may allow a program to make added assumptions about arguments to procedures that it can no longer make after the procedure is inlined, preventing some optimizations. Smarter compilers (such as [[Glasgow Haskell Compiler]] (GHC)) will track this, but naive inlining loses this information.

A further benefit of inlining for the memory system is:
* Eliminating branches and keeping code that is executed close together in memory improves instruction cache performance by improving [[locality of reference]] (spatial locality and sequentiality of instructions). This is smaller than optimizations that specifically target sequentiality, but is significant.{{sfn|Chen|Chang|Conte|Hwu|1993|loc=3.4 Function inline expansion, p. 19–20}}

The direct cost of inlining is increased code size, due to duplicating the function body at each call site. However, it does not always do so, namely in case of very short functions, where the function body is smaller than the size of a function call (at the caller, including argument and return value handling), such as trivial [[accessor method]]s or [[mutator method]]s (getters and setters); or for a function that is only used in one place, in which case it is not duplicated. Thus inlining may be minimized or eliminated if optimizing for code size, as is often the case in [[embedded system]]s.

Inlining also imposes a cost on performance, due to the code expansion (due to duplication) hurting instruction cache performance.<ref name="webkit">{{cite web |url=https://www.webkit.org/blog/2826/unusual-speed-boost-size-matters/ |title=Unusual speed boost: size matters |author=Benjamin Poulain |date=August 8, 2013}}</ref> This is most significant if, before expansion, the [[working set]] of the program (or a hot section of code) fit in one level of the memory hierarchy (e.g., [[L1 cache]]), but after expansion it no longer fits, resulting in frequent cache misses at that level. Due to the significant difference in performance at different levels of the hierarchy, this hurts performance considerably. At the highest level this can result in increased [[page fault]]s, catastrophic performance degradation due to [[thrashing (computer science)|thrashing]], or the program failing to run at all. This last is rare in common desktop and server applications, where code size is small relative to available memory, but can be an issue for resource-constrained environments such as embedded systems. One way to mitigate this problem is to split functions into a smaller hot inline path ([[fast path]]), and a larger cold non-inline path (slow path).<ref name="webkit"/>

Inlining hurting performance is a problem for mainly large functions that are used in many places, but the break-even point beyond which inlining reduces performance is difficult to determine and depends in general on precise load, so it can be subject to manual optimization or [[profile-guided optimization]].<ref>See for example the [http://jikesrvm.org/Adaptive+Optimization+System Adaptive Optimization System] {{Webarchive|url=https://web.archive.org/web/20110809144146/http://jikesrvm.org/Adaptive+Optimization+System |date=2011-08-09}} in the [[Jikes RVM]] for Java.</ref> This is a similar issue to other code expanding optimizations such as [[loop unrolling]], which also reduces number of instructions processed, but can decrease performance due to poorer cache performance.

The precise effect of inlining on cache performance is complex. For small cache sizes (much smaller than the working set before expansion), the increased sequentiality dominates, and inlining improves cache performance. For cache sizes close to the working set, where inlining expands the working set so it no longer fits in cache, this dominates and cache performance decreases. For cache sizes larger than the working set, inlining has negligible impact on cache performance. Further, changes in cache design, such as [[load forwarding]], can offset the increase in cache misses.{{sfn|Chen|Chang|Conte|Hwu|1993|loc=3.4 Function inline expansion, p. 24–26}}