Editing Program optimization (section)

==Levels of optimization==
Optimization can occur at a number of levels. Typically the higher levels have greater impact, and are harder to change later on in a project, requiring significant changes or a complete rewrite if they need to be changed. Thus optimization can typically proceed via refinement from higher to lower, with initial gains being larger and achieved with less work, and later gains being smaller and requiring more work. However, in some cases overall performance depends on performance of very low-level portions of a program, and small changes at a late stage or early consideration of low-level details can have outsized impact. Typically some consideration is given to efficiency throughout a project{{snd}} though this varies significantly{{snd}} but major optimization is often considered a refinement to be done late, if ever. On longer-running projects there are typically cycles of optimization, where improving one area reveals limitations in another, and these are typically curtailed when performance is acceptable or gains become too small or costly.

As performance is part of the specification of a program{{snd}} a program that is unusably slow is not fit for purpose: a video game with 60&nbsp;Hz (frames-per-second) is acceptable, but 6 frames-per-second is unacceptably choppy{{snd}} performance is a consideration from the start, to ensure that the system is able to deliver sufficient performance, and early prototypes need to have roughly acceptable performance for there to be confidence that the final system will (with optimization) achieve acceptable performance. This is sometimes omitted in the belief that optimization can always be done later, resulting in prototype systems that are far too slow{{snd}} often by an [[order of magnitude]] or more{{snd}} and systems that ultimately are failures because they architecturally cannot achieve their performance goals, such as the [[Intel 432]] (1981); or ones that take years of work to achieve acceptable performance, such as Java (1995), which only achieved acceptable performance with [[HotSpot (virtual machine)|HotSpot]] (1999). The degree to which performance changes between prototype and production system, and how amenable it is to optimization, can be a significant source of uncertainty and risk.

===Design level===
At the highest level, the design may be optimized to make best use of the available resources, given goals, constraints, and expected use/load. The architectural design of a system overwhelmingly affects its performance. For example, a system that is network latency-bound (where network latency is the main constraint on overall performance) would be optimized to minimize network trips, ideally making a single request (or no requests, as in a [[push protocol]]) rather than multiple roundtrips. Choice of design depends on the goals: when designing a [[compiler]], if fast compilation is the key priority, a [[one-pass compiler]] is faster than a [[multi-pass compiler]] (assuming same work), but if speed of output code is the goal, a slower multi-pass compiler fulfills the goal better, even though it takes longer itself. Choice of platform and programming language occur at this level, and changing them frequently requires a complete rewrite, though a modular system may allow rewrite of only some component{{snd}} for example, for a Python program one may rewrite performance-critical sections in C. In a distributed system, choice of architecture ([[client-server]], [[peer-to-peer]], etc.) occurs at the design level, and may be difficult to change, particularly if all components cannot be replaced in sync (e.g., old clients).

===Algorithms and data structures===
Given an overall design, a good choice of [[algorithmic efficiency|efficient algorithms]] and [[data structure]]s, and efficient implementation of these algorithms and data structures comes next. After design, the choice of [[algorithm]]s and data structures affects efficiency more than any other aspect of the program. Generally data structures are more difficult to change than algorithms, as a data structure assumption and its performance assumptions are used throughout the program, though this can be minimized by the use of [[abstract data type]]s in function definitions, and keeping the concrete data structure definitions restricted to a few places.

For algorithms, this primarily consists of ensuring that algorithms are constant O(1), logarithmic O(log ''n''), linear O(''n''), or in some cases log-linear O(''n'' log ''n'') in the input (both in space and time). Algorithms with quadratic complexity O(''n''<sup>2</sup>) fail to scale, and even linear algorithms cause problems if repeatedly called, and are typically replaced with constant or logarithmic if possible.

Beyond asymptotic order of growth, the constant factors matter: an asymptotically slower algorithm may be faster or smaller (because simpler) than an asymptotically faster algorithm when they are both faced with small input, which may be the case that occurs in reality. Often a [[hybrid algorithm]] will provide the best performance, due to this tradeoff changing with size.

A general technique to improve performance is to avoid work. A good example is the use of a [[fast path]] for common cases, improving performance by avoiding unnecessary work. For example, using a simple text layout algorithm for Latin text, only switching to a complex layout algorithm for complex scripts, such as [[Devanagari]]. Another important technique is caching, particularly [[memoization]], which avoids redundant computations. Because of the importance of caching, there are often many levels of caching in a system, which can cause problems from memory use, and correctness issues from stale caches.

===Source code level===
Beyond general algorithms and their implementation on an abstract machine, concrete source code level choices can make a significant difference. For example, on early C compilers, <code>while(1)</code> was slower than <code>for(;;)</code> for an unconditional loop, because <code>while(1)</code> evaluated 1 and then had a conditional jump which tested if it was true, while <code>for (;;)</code> had an unconditional jump . Some optimizations (such as this one) can nowadays be performed by [[optimizing compiler]]s. This depends on the source language, the target machine language, and the compiler, and can be both difficult to understand or predict and changes over time; this is a key place where understanding of compilers and machine code can improve performance. [[Loop-invariant code motion]] and [[return value optimization]] are examples of optimizations that reduce the need for auxiliary variables and can even result in faster performance by avoiding round-about optimizations.

===Build level===
Between the source and compile level, [[Directive (programming)|directives]] and [[Build automation|build flags]] can be used to tune performance options in the source code and compiler respectively, such as using [[preprocessor]] defines to disable unneeded software features, optimizing for specific processor models or hardware capabilities, or predicting [[branch (computer science)|branching]], for instance. Source-based software distribution systems such as [[Berkeley Software Distribution|BSD]]'s [[Ports collection|Ports]] and [[Gentoo Linux|Gentoo]]'s [[Portage (software)|Portage]] can take advantage of this form of optimization.

===Compile level===
Use of an [[optimizing compiler]] tends to ensure that the [[executable program]] is optimized at least as much as the compiler can predict.

===Assembly level===
At the lowest level, writing code using an [[assembly language]], designed for a particular hardware platform can produce the most efficient and compact code if the programmer takes advantage of the full repertoire of [[machine instruction]]s. Many [[operating system]]s used on [[embedded system]]s have been traditionally written in assembler code for this reason. Programs (other than very small programs) are seldom written from start to finish in assembly due to the time and cost involved. Most are compiled down from a high level language to assembly and hand optimized from there. When efficiency and size are less important large parts may be written in a high-level language.

With more modern [[optimizing compiler]]s and the greater complexity of recent [[CPU]]s, it is harder to write more efficient code than what the compiler generates, and few projects need this "ultimate" optimization step.

Much of the code written today is intended to run on as many machines as possible. As a consequence, programmers and compilers don't always take advantage of the more efficient instructions provided by newer CPUs or quirks of older models. Additionally, assembly code tuned for a particular processor without using such instructions might still be suboptimal on a different processor, expecting a different tuning of the code.

Typically today rather than writing in assembly language, programmers will use a [[disassembler]] to analyze the output of a compiler and change the high-level source code so that it can be compiled more efficiently, or understand why it is inefficient.

===Run time===
[[Just-in-time compilation|Just-in-time]] compilers can produce customized machine code based on run-time data, at the cost of compilation overhead. This technique dates to the earliest [[regular expression]] engines, and has become widespread with Java HotSpot and V8 for JavaScript. In some cases [[adaptive optimization]] may be able to perform [[run time (program lifecycle phase)|run time]] optimization exceeding the capability of static compilers by dynamically adjusting parameters according to the actual input or other factors.

[[Profile-guided optimization]] is an ahead-of-time (AOT) compilation optimization technique based on run time profiles, and is similar to a static "average case" analog of the dynamic technique of adaptive optimization.

[[Self-modifying code]] can alter itself in response to run time conditions in order to optimize code; this was more common in assembly language programs.

Some [[CPU design]]s can perform some optimizations at run time. Some examples include [[out-of-order execution]], [[speculative execution]], [[instruction pipeline]]s, and [[branch predictor]]s. Compilers can help the program take advantage of these CPU features, for example through [[instruction scheduling]].

===Platform dependent and independent optimizations===
Code optimization can be also broadly categorized as [[computer platform|platform]]-dependent and platform-independent techniques. While the latter ones are effective on most or all platforms, platform-dependent techniques use specific properties of one platform, or rely on parameters depending on the single platform or even on the single processor. Writing or producing different versions of the same code for different processors might therefore be needed. For instance, in the case of compile-level optimization, platform-independent techniques are generic techniques (such as [[loop unwinding|loop unrolling]], reduction in function calls, memory efficient routines, reduction in conditions, etc.), that impact most CPU architectures in a similar way. A great example of platform-independent optimization has been shown with inner for loop, where it was observed that a loop with an inner for loop performs more computations per unit time than a loop without it or one with an inner while loop.<ref>{{Cite journal|last=Adewumi|first=Tosin P.|date=2018-08-01|title=Inner loop program construct: A faster way for program execution|journal=Open Computer Science|language=en|volume=8|issue=1|pages=115–122|doi=10.1515/comp-2018-0004|doi-access=free}}</ref> Generally, these serve to reduce the total [[instruction path length]] required to complete the program and/or reduce total memory usage during the process. On the other hand, platform-dependent techniques involve instruction scheduling, [[instruction-level parallelism]], data-level parallelism, cache optimization techniques (i.e., parameters that differ among various platforms) and the optimal instruction scheduling might be different even on different processors of the same architecture.