Editing Parsing (section)

== Lookahead ==
{{unreferenced section|date=April 2012}}
[[File:Parsing a C program that needs 2 token lookahead.svg|thumb|300px|[[C (programming language)|C]] program that cannot be parsed with less than 2 token lookahead. ''Top:'' C grammar excerpt.<ref>taken from {{cite book | isbn=0131103628 | author=Brian W. Kernighan and Dennis M. Ritchie | title=The C Programming Language | edition=2nd | location=Englewood Cliffs/NJ | publisher=Prentice Hall | series=Prentice Hall Software Series | date=Apr 1988 | url-access=registration | url=https://archive.org/details/cprogramminglang00bria }} (Appendix A.13 "Grammar", p.193 ff)</ref> ''Bottom:'' a parser has digested the tokens "<syntaxhighlight lang="c" inline>int v;main(){</syntaxhighlight>" and is about to choose a rule to derive ''Stmt''. Looking only at the first lookahead token "<syntaxhighlight lang="c" inline>v</syntaxhighlight>", it cannot decide which of both alternatives for ''Stmt'' to choose; the latter requires peeking at the second token.]]
Lookahead establishes the maximum incoming tokens that a parser can use to decide which rule it should use. Lookahead is especially relevant to [[LL parser|LL]], [[LR parser|LR]], and [[LALR parser]]s, where it is often explicitly indicated by affixing the lookahead to the algorithm name in parentheses, such as LALR(1).

Most [[programming language]]s, the primary target of parsers, are carefully defined in such a way that a parser with limited lookahead, typically one, can parse them, because parsers with limited lookahead are often more efficient. One important change{{Citation needed|date=December 2008}} to this trend came in 1990 when [[Terence Parr]] created [[ANTLR]] for his Ph.D. thesis, a [[parser generator]] for efficient LL(''k'') parsers, where ''k'' is any fixed value.

LR parsers typically have only a few actions after seeing each token. They are shift (add this token to the stack for later reduction), reduce (pop tokens from the stack and form a syntactic construct), end, error (no known rule applies) or conflict (does not know whether to shift or reduce).

Lookahead has two advantages.{{clarify|reason=This paragraph still seems to apply only to LR parsers.|date=April 2019}}
* It helps the parser take the correct action in case of conflicts. For example, parsing the if statement in the case of an else clause.
* It eliminates many duplicate states and eases the burden of an extra stack. A C language non-lookahead parser will have around 10,000 states. A lookahead parser will have around 300 states.

Example: Parsing the Expression  {{nowrap|1 + 2 * 3}}
{{dubious|date=April 2019}}
{| class="toccolours"
| colspan=3 | Set of expression parsing rules (called grammar) is as follows,
|-
| Rule1: || E → E + E  || style="padding-left:1em" | Expression is the sum of two expressions.
|-
| Rule2: || E → E * E  || style="padding-left:1em" |Expression is the product of two expressions.
|-
| Rule3: || E → number || style="padding-left:1em" |Expression is a simple number
|-
| Rule4: || colspan=2 | + has less precedence than *
|}
Most programming languages (except for a few such as APL and Smalltalk) and algebraic formulas give higher precedence to multiplication than addition, in which case the correct interpretation of the example above is {{nowrap|1 + (2 * 3)}}.
Note that Rule4 above is a semantic rule. It is possible to rewrite the grammar to incorporate this into the syntax. However, not all such rules can be translated into syntax.

; Simple non-lookahead parser actions
Initially Input = [1, +, 2, *, 3]
# Shift "1" onto stack from input (in anticipation of rule3). Input = [+, 2, *, 3] Stack = [1]
# Reduces "1" to expression "E" based on rule3. Stack = [E]
# Shift "+" onto stack from input (in anticipation of rule1). Input = [2, *, 3] Stack = [E, +]
# Shift "2" onto stack from input (in anticipation of rule3). Input = [*, 3] Stack = [E, +, 2]
# Reduce stack element "2" to Expression "E" based on rule3. Stack = [E, +, E]
# Reduce stack items [E, +, E] and new input "E" to "E" based on rule1. Stack = [E]
# Shift "*" onto stack from input (in anticipation of rule2). Input = [3] Stack = [E,*]
# Shift "3" onto stack from input (in anticipation of rule3). Input = [] (empty) Stack = [E, *, 3]
# Reduce stack element "3" to expression "E" based on rule3. Stack = [E, *, E]
# Reduce stack items [E, *, E] and new input "E" to "E" based on rule2. Stack = [E]

The parse tree and resulting code from it is not correct according to language semantics.

To correctly parse without lookahead, there are three solutions:
* The user has to enclose expressions within parentheses. This often is not a viable solution.
* The parser needs to have more logic to backtrack and retry whenever a rule is violated or not complete. The similar method is followed in LL parsers.
* Alternatively, the parser or grammar needs to have extra logic to delay reduction and reduce only when it is absolutely sure which rule to reduce first. This method is used in LR parsers. This correctly parses the expression but with many more states and increased stack depth.

; Lookahead parser actions{{clarify|reason=While the previous text is highly dubious, the following paragraph could possibly turned into a sensible explanation about how an LR parser uses lookahead. To this end, a parser table excerpt (implementing the precedence) should be shown, the parsing mechnism should be sketched (when to shift, when to reduce, etc.), and the exaple run should be given in more tabular form, and without magic ('anticipation').|date=April 2019}}
# Shift 1 onto stack on input 1 in anticipation of rule3. It does not reduce immediately.
# Reduce stack item 1 to simple Expression on input + based on rule3. The lookahead is +, so we are on path to E +, so we can reduce the stack to E.
# Shift + onto stack on input + in anticipation of rule1.
# Shift 2 onto stack on input 2 in anticipation of rule3.
# Reduce stack item 2 to Expression on input * based on rule3. The lookahead * expects only E before it.
# Now stack has E + E and still the input is *. It has two choices now, either to shift based on rule2 or reduction based on rule1. Since * has higher precedence than + based on rule4, we shift * onto stack in anticipation of rule2.
# Shift 3 onto stack on input 3 in anticipation of rule3.
# Reduce stack item 3 to Expression after seeing end of input based on rule3.
# Reduce stack items E * E to E based on rule2.
# Reduce stack items E + E to E based on rule1.
The parse tree generated is correct and simply {{clarify span|more efficient|reason=A parse tree and a parser cannot be compared w.r.t. efficiency. Even comparing two parse trees dosn't make sense here; expression efficiency isn't a matter of parsing, but of optimization.|date=April 2019}}{{Citation needed|date=April 2011}} than non-lookahead parsers. This is the strategy followed in [[LALR parser]]s.