Editing Compiler (section)

== Compiler construction ==
{{more footnotes needed|section|date=December 2019}}
A compiler implements a formal transformation from a high-level source program to a low-level target program. Compiler design can define an end-to-end solution or tackle a defined subset that interfaces with other compilation tools e.g. preprocessors, assemblers, linkers. Design requirements include rigorously defined interfaces both internally between compiler components and externally between supporting toolsets.

In the early days, the approach taken to compiler design was directly affected by the complexity of the computer language to be processed, the experience of the person(s) designing it, and the resources available. Resource limitations led to the need to pass through the source code more than once.

A compiler for a relatively simple language written by one person might be a single, monolithic piece of software. However, as the source language grows in complexity the design may be split into a number of interdependent phases. Separate phases provide design improvements that focus development on the functions in the compilation process.

=== One-pass vis-à-vis multi-pass compilers{{anchor|Single-pass}} ===
Classifying compilers by number of passes has its background in the hardware resource limitations of computers. Compiling involves performing much work and early computers did not have enough memory to contain one program that did all of this work. As a result, compilers were split up into smaller programs which each made a pass over the source (or some representation of it) performing some of the required analysis and translations.

The ability to compile in a [[one-pass compiler|single pass]] has classically been seen as a benefit because it simplifies the job of writing a compiler and one-pass compilers generally perform compilations faster than [[multi-pass compiler]]s. Thus, partly driven by the resource limitations of early systems, many early languages were specifically designed so that they could be compiled in a single pass (e.g., [[Pascal (programming language)|Pascal]]).

In some cases, the design of a language feature may require a compiler to perform more than one pass over the source. For instance, consider a declaration appearing on line 20 of the source which affects the translation of a statement appearing on line 10. In this case, the first pass needs to gather information about declarations appearing after statements that they affect, with the actual translation happening during a subsequent pass.

The disadvantage of compiling in a single pass is that it is not possible to perform many of the sophisticated [[compiler optimization|optimizations]] needed to generate high quality code. It can be difficult to count exactly how many passes an optimizing compiler makes. For instance, different phases of optimization may analyse one expression many times but only analyse another expression once.

Splitting a compiler up into small programs is a technique used by researchers interested in producing provably correct compilers. Proving the correctness of a set of small programs often requires less effort than proving the correctness of a larger, single, equivalent program.

=== Three-stage compiler structure ===
[[File:Compiler design.svg|thumb|center|upright=2.5|Compiler design]]
Regardless of the exact number of phases in the compiler design, the phases can be assigned to one of three stages. The stages include a front end, a middle end, and a back end.
* The ''front end'' scans the input and verifies syntax and semantics according to a specific source language. For [[Type system|statically typed languages]] it performs [[type checking]] by collecting type information. If the input program is syntactically incorrect or has a type error, it generates error and/or warning messages, usually identifying the location in the source code where the problem was detected; in some cases the actual error may be (much) earlier in the program. Aspects of the front end include lexical analysis, syntax analysis, and semantic analysis. The front end transforms the input program into an [[intermediate representation]] (IR) for further processing by the middle end. This IR is usually a lower-level representation of the program with respect to the source code.
* The ''middle end'' performs optimizations on the IR that are independent of the CPU architecture being targeted. This source code/machine code independence is intended to enable generic optimizations to be shared between versions of the compiler supporting different languages and target processors. Examples of middle end optimizations are removal of useless ([[dead-code elimination]]) or unreachable code ([[reachability analysis]]), discovery and propagation of constant values ([[constant propagation]]), relocation of computation to a less frequently executed place (e.g., out of a loop), or specialization of computation based on the context, eventually producing the "optimized" IR that is used by the back end.
* The ''back end'' takes the optimized IR from the middle end. It may perform more analysis, transformations and optimizations that are specific for the target CPU architecture. The back end generates the target-dependent assembly code, performing [[register allocation]] in the process. The back end performs [[instruction scheduling]], which re-orders instructions to keep parallel [[execution unit]]s busy by filling [[delay slot]]s. Although most optimization problems are [[NP-hardness|NP-hard]], [[Heuristic (computer science)|heuristic]] techniques for solving them are well-developed and implemented in production-quality compilers. Typically the output of a back end is machine code specialized for a particular processor and operating system.

This front/middle/back-end approach makes it possible to combine front ends for different languages with back ends for different [[Central processing unit|CPUs]] while sharing the optimizations of the middle end.<ref>Cooper and Torczon 2012, p. 8</ref> Practical examples of this approach are the [[GNU Compiler Collection]], [[Clang]] ([[LLVM]]-based C/C++ compiler),<ref name="LattnerBook1st">{{cite book | author = Lattner, Chris |editor = Brown, Amy |editor2=Wilson, Greg | date = 2017 | chapter = LLVM | title = The Architecture of Open Source Applications | chapter-url = http://www.aosabook.org/en/llvm.html | access-date = 28 February 2017 | url-status = live | archive-url = https://web.archive.org/web/20161202070941/http://aosabook.org/en/llvm.html | archive-date = 2 December 2016}}</ref> and the [[Amsterdam Compiler Kit]], which have multiple front-ends, shared optimizations and multiple back-ends.

==== Front end ====
[[File:Xxx Scanner and parser example for C.gif|thumb|right|400px|[[Lexical analysis|Lexer]] and [[Parsing|parser]] example for [[C (programming language)|C]]. Starting from the sequence of characters "<code>if(net>0.0)total+=net*(1.0+tax/100.0);</code>", the scanner composes a sequence of [[Lexical analysis#token|tokens]], and categorizes each of them, for example as {{color|#600000|identifier}}, {{color|#606000|reserved word}}, {{color|#006000|number literal}}, or {{color|#000060|operator}}. The latter sequence is transformed by the parser into a [[abstract syntax tree|syntax tree]], which is then treated by the remaining compiler phases. The scanner and parser handles the [[regular grammar|regular]] and properly [[context-free grammar|context-free]] parts of the [[C syntax|grammar for C]], respectively.]]

The front end analyzes the source code to build an internal representation of the program, called the [[intermediate representation]] (IR). It also manages the [[symbol table]], a data structure mapping each symbol in the source code to associated information such as location, type and scope.

While the frontend can be a single monolithic function or program, as in a [[scannerless parser]], it was traditionally implemented and analyzed as several phases, which may execute sequentially or concurrently. This method is favored due to its modularity and [[separation of concerns]]. Most commonly, the frontend is broken into three phases: [[lexical analysis]] (also known as lexing or scanning), [[syntax analysis]] (also known as scanning or parsing), and [[Semantic analysis (compilers)|semantic analysis]]. Lexing and parsing comprise the syntactic analysis (word syntax and phrase syntax, respectively), and in simple cases, these modules (the lexer and parser) can be automatically generated from a grammar for the language, though in more complex cases these require manual modification. The lexical grammar and phrase grammar are usually [[context-free grammar]]s, which simplifies analysis significantly, with context-sensitivity handled at the semantic analysis phase. The semantic analysis phase is generally more complex and written by hand, but can be partially or fully automated using [[attribute grammar]]s. These phases themselves can be further broken down: lexing as scanning and evaluating, and parsing as building a [[Parse tree|concrete syntax tree]] (CST, parse tree) and then transforming it into an [[abstract syntax tree]] (AST, syntax tree). In some cases additional phases are used, notably ''line reconstruction'' and ''preprocessing,'' but these are rare.

The main phases of the front end include the following:
* ''{{visible anchor|Line reconstruction}}'' converts the input character sequence to a canonical form ready for the parser. Languages which [[stropping (syntax)|strop]] their keywords or allow arbitrary spaces within identifiers require this phase. The [[top-down parsing|top-down]], [[recursive descent parser|recursive-descent]], table-driven parsers used in the 1960s typically read the source one character at a time and did not require a separate tokenizing phase. [[Atlas Autocode]] and [[Edinburgh IMP|Imp]] (and some implementations of [[ALGOL]] and [[Coral 66]]) are examples of stropped languages whose compilers would have a ''Line Reconstruction'' phase.
* ''[[Preprocessor|Preprocessing]]'' supports [[Macro (computer science)|macro]] substitution and [[conditional compilation]]. Typically the preprocessing phase occurs before syntactic or semantic analysis; e.g. in the case of C, the preprocessor manipulates lexical tokens rather than syntactic forms. However, some languages such as [[Scheme (programming language)|Scheme]] support macro substitutions based on syntactic forms.
* ''[[Lexical analysis]]'' (also known as ''lexing'' or ''tokenization'') breaks the source code text into a sequence of small pieces called ''lexical tokens''.<ref>Aho, Lam, Sethi, Ullman 2007, p. 5-6, 109-189</ref> This phase can be divided into two stages: the ''scanning'', which segments the input text into syntactic units called ''lexemes'' and assigns them a category; and the ''evaluating'', which converts lexemes into a processed value. A token is a pair consisting of a ''token name'' and an optional ''token value''.<ref>Aho, Lam, Sethi, Ullman 2007, p. 111</ref> Common token categories may include identifiers, keywords, separators, operators, literals and comments, although the set of token categories varies in different [[programming language]]s. The lexeme syntax is typically a [[regular language]], so a [[finite-state automaton]] constructed from a [[regular expression]] can be used to recognize it. The software doing lexical analysis is called a [[lexical analyzer]]. This may not be a separate step—it can be combined with the parsing step in [[scannerless parsing]], in which case parsing is done at the character level, not the token level.
* ''[[Syntax analysis]]'' (also known as ''parsing'') involves [[parsing]] the token sequence to identify the syntactic structure of the program. This phase typically builds a [[parse tree]], which replaces the linear sequence of tokens with a tree structure built according to the rules of a [[formal grammar]] which define the language's syntax. The parse tree is often analyzed, augmented, and transformed by later phases in the compiler.<ref>Aho, Lam, Sethi, Ullman 2007, p. 8, 191-300</ref>
* ''[[Semantic analysis (compilers)|Semantic analysis]]'' adds semantic information to the [[parse tree]] and builds the [[symbol table]]. This phase performs semantic checks such as [[type checking]] (checking for type errors), or [[object binding]] (associating variable and function references with their definitions), or [[definite assignment analysis|definite assignment]] (requiring all local variables to be initialized before use), rejecting incorrect programs or issuing warnings. Semantic analysis usually requires a complete parse tree, meaning that this phase logically follows the [[parsing]] phase, and logically precedes the [[code generation (compiler)|code generation]] phase, though it is often possible to fold multiple phases into one pass over the code in a compiler implementation.

==== Middle end ====
The middle end, also known as ''optimizer,'' performs optimizations on the intermediate representation in order to improve the performance and the quality of the produced machine code.<ref name="Hjort Blindell, Gabriel">{{Cite book |title=Instruction selection: Principles, methods, and applications |publisher=Springer |last= Blindell |first=Gabriel Hjort |isbn=978-3-31934019-7 |location=Switzerland |oclc=951745657 |date=2016-06-03}}</ref> The middle end contains those optimizations that are independent of the CPU architecture being targeted.

The main phases of the middle end include the following:
* [[Compiler analysis|Analysis]]: This is the gathering of program information from the intermediate representation derived from the input; [[data-flow analysis]] is used to build [[use-define chain]]s, together with [[dependence analysis]], [[alias analysis]], [[pointer analysis]], [[escape analysis]], etc. Accurate analysis is the basis for any compiler optimization. The [[control-flow graph]] of every compiled function and the [[call graph]] of the program are usually also built during the analysis phase.
* [[Compiler optimization|Optimization]]: the intermediate language representation is transformed into functionally equivalent but faster (or smaller) forms. Popular optimizations are [[inline expansion]], [[dead-code elimination]], [[constant propagation]], [[loop transformation]] and even [[automatic parallelization]].
Compiler analysis is the prerequisite for any compiler optimization, and they tightly work together. For example, [[dependence analysis]] is crucial for [[loop transformation]].

The scope of compiler analysis and optimizations vary greatly; their scope may range from operating within a [[basic block]], to whole procedures, or even the whole program.  There is a trade-off between the granularity of the optimizations and the cost of compilation.  For example, [[peephole optimization]]s are fast to perform during compilation but only affect a small local fragment of the code, and can be performed independently of the context in which the code fragment appears.  In contrast, [[interprocedural optimization]] requires more compilation time and memory space, but enable optimizations that are only possible by considering the behavior of multiple functions simultaneously.

Interprocedural analysis and optimizations are common in modern commercial compilers from [[Hewlett-Packard|HP]], [[IBM]], [[Silicon Graphics|SGI]], [[Intel]], [[Microsoft]], and [[Sun Microsystems]]. The [[free software]] [[GNU Compiler Collection|GCC]] was criticized for a long time for lacking powerful interprocedural optimizations, but it is changing in this respect. Another open source compiler with full analysis and optimization infrastructure is [[Open64]], which is used by many organizations for research and commercial purposes.

Due to the extra time and space needed for compiler analysis and optimizations, some compilers skip them by default. Users have to use compilation options to explicitly tell the compiler which optimizations should be enabled.

==== Back end ====
The back end is responsible for the CPU architecture specific optimizations and for [[code generation (compiler)|code generation]].<ref name="Hjort Blindell, Gabriel"/>

The main phases of the back end include the following:
* ''Machine dependent optimizations'': optimizations that depend on the details of the CPU architecture that the compiler targets.<ref>Cooper and Toczon (2012), p. 540</ref> A prominent example is [[peephole optimization]]s, which rewrites short sequences of assembler instructions into more efficient instructions.
* ''[[Code generation (compiler)|Code generation]]'': the transformed intermediate language is translated into the output language, usually the native [[machine language]] of the system. This involves resource and storage decisions, such as deciding which variables to fit into [[Register allocation|registers]] and memory and the [[Instruction selection|selection]] and [[Instruction scheduling|scheduling]] of appropriate machine instructions along with their associated [[addressing mode]]s (see also [[Sethi–Ullman algorithm]]). Debug data may also need to be generated to facilitate [[debugging]].

=== Compiler correctness ===
{{Main|Compiler correctness}}
[[Compiler correctness]] is the branch of software engineering that deals with trying to show that a compiler behaves according to its [[programming language|language specification]].<ref>{{Citation |title=S1-A Simple Compiler |date=2012-02-28 |url=http://dx.doi.org/10.1002/9781118112762.ch12 |work=Compiler Construction Using Java, JavaCC, and Yacc |pages=289–329 |access-date=2023-05-17 |place=Hoboken, NJ, US |publisher=John Wiley & Sons, Inc. |doi=10.1002/9781118112762.ch12 |isbn=978-1-118-11276-2|url-access=subscription }}</ref> Techniques include developing the compiler using [[formal methods]] and using rigorous testing (often called compiler validation) on an existing compiler.