Editing LR parser (section)

== Overview ==

=== Bottom-up parse tree for example A * 2 + 1 ===

[[File:Shift-Reduce Parse Steps for A*2+1.svg|thumb|x200px|right|Bottom-up parse tree built in numbered steps]]

An LR parser scans and parses the input text in one forward pass over the text.  The parser builds up the [[parse tree]] incrementally, bottom up, and left to right, without guessing or backtracking.  At every point in this pass, the parser has accumulated a list of subtrees or phrases of the input text that have been already parsed.  Those subtrees are not yet joined together because the parser has not yet reached the right end of the syntax pattern that will combine them.

At step 6 in an example parse, only "A * 2" has been parsed, incompletely.  Only the shaded lower-left corner of the parse tree exists.  None of the parse tree nodes numbered 7 and above exist yet.  Nodes 3, 4, and 6 are the roots of isolated subtrees for variable A, operator *, and number 2, respectively.  These three root nodes are temporarily held in a parse stack.  The remaining unparsed portion of the input stream is "+ 1".

=== Shift and reduce actions ===

As with other shift-reduce parsers, an LR parser works by doing some combination of Shift steps and Reduce steps.  
* A '''Shift''' step advances in the input stream by one symbol.  That shifted symbol becomes a new single-node parse tree.
* A '''Reduce''' step applies a completed grammar rule to some of the recent parse trees, joining them together as one tree with a new root symbol.
If the input has no syntax errors, the parser continues with these steps until all of the input has been consumed and all of the parse trees have been reduced to a single tree representing an entire legal input.

LR parsers differ from other shift-reduce parsers in how they decide when to reduce, and how to pick between rules with similar endings.  But the final decisions and the sequence of shift or reduce steps are the same.

Much of the LR parser's efficiency is from being deterministic. To avoid guessing, the LR parser often looks ahead (rightwards) at the next scanned symbol, before deciding what to do with previously scanned symbols.  The lexical scanner works one or more symbols ahead of the parser.  The '''lookahead''' symbols are the 'right-hand context' for the parsing decision.<ref>
Engineering a Compiler (2nd edition), by Keith Cooper and Linda Torczon, [[Morgan Kaufmann]] 2011.</ref>

=== Bottom-up parse stack ===

[[File:Bottom-Up Parser.svg|thumb|x250px|right|Bottom-up parser at step 6]]

Like other shift-reduce parsers, an LR parser lazily waits until it has scanned and parsed all parts of some construct before committing to what the combined construct is.  The parser then acts immediately on the combination instead of waiting any further.  In the parse tree example, the phrase A gets reduced to Value and then to Products in steps 1-3 as soon as lookahead * is seen, rather than waiting any later to organize those parts of the parse tree.  The decisions for how to handle A are based only on what the parser and scanner have already seen, without considering things that appear much later to the right.

Reductions reorganize the most recently parsed things, immediately to the left of the lookahead symbol.  So the list of already-parsed things acts like a [[stack (abstract data type)|stack]].  This '''parse stack''' grows rightwards.  The base or bottom of the stack is on the left and holds the leftmost, oldest parse fragment.  Every reduction step acts only on the rightmost, newest parse fragments.  (This accumulative parse stack is very unlike the predictive, leftward-growing parse stack used by [[top-down parser]]s.)

=== Bottom-up parse steps for example A * 2 + 1 ===

{| class="wikitable"
|-
! <small>Step</small> !! <small>Parse Stack</small>  !!  <small>Unparsed</small> !! <small>Shift/Reduce</small>
|-
| 0 || ''empty'' || align="right" | A * 2 + 1 || shift
|-
| 1 || ''id'' || align="right" | * 2 + 1 || Value → ''id''  
|-
| 2 || Value || align="right" | * 2 + 1 || Products → Value 
|-
| 3 || Products || align="right" | * 2 + 1 || shift 
|-
| 4 || Products * || align="right" | 2 + 1 || shift
|-
| 5 || Products * ''int'' || align="right" | + 1 || Value → ''int''
|-
| 6 || Products * Value || align="right" | + 1 ||  Products → Products * Value
|-
| 7 || Products || align="right" | + 1 || Sums → Products 
|-
| 8 || Sums || align="right" | + 1 || shift  
|-
| 9 || Sums + || align="right" | 1 || shift 
|- 
| 10 || Sums + ''int'' || ''eof'' || Value → ''int'' 
|-
| 11 || Sums + Value || ''eof'' || Products → Value
|-
| 12 || Sums + Products || ''eof'' ||  Sums → Sums + Products 
|-
| 13 || Sums || ''eof'' || accept 
|-
|}

Step 6 applies a grammar rule with multiple parts:
: Products → Products * Value
This matches the stack top holding the parsed phrases "... Products * Value".  The reduce step replaces this instance of the rule's right hand side, "Products * Value" by the rule's left hand side symbol, here a larger Products.  If the parser builds complete parse trees, the three trees for inner Products, *, and Value are combined by a new tree root for Products.  Otherwise, [[Semantics#Programming languages|semantic]]{{Broken anchor|date=2025-04-28|bot=User:Cewbot/log/20201008/configuration|target_link=Semantics#Programming languages|reason=Anchor "Semantics#Programming languages" links to a specific web page: "Programming languages". The anchor (Programming languages) [[Special:Diff/1208968924|has been deleted]].|diff_id=1208968924}} details from the inner Products and Value are output to some later compiler pass, or are combined and saved in the new Products symbol.<ref>
Crafting and Compiler, by Charles Fischer, Ron Cytron, and Richard LeBlanc, Addison Wesley 2009.</ref>

=== LR parse steps for example A * 2 + 1 ===

In LR parsers, the shift and reduce decisions are potentially based on the entire stack of everything that has been previously parsed, not just on a single, topmost stack symbol. If done in an unclever way, that could lead to very slow parsers that get slower and slower for longer inputs. LR parsers do this with constant speed, by summarizing all the relevant left context information into a single number called the LR(0) '''parser state'''. For each grammar and LR analysis method, there is a fixed (finite) number of such states. Besides holding the already-parsed symbols, the parse stack also remembers the state numbers reached by everything up to those points.

At every parse step, the entire input text is divided into a stack of previously parsed phrases, a current look-ahead symbol, and the remaining unscanned text.  The parser's next action is determined by its current LR(0) {{color|#928|state number}} (rightmost on the stack) and the lookahead symbol. In the steps below, all the black details are exactly the same as in other non-LR shift-reduce parsers. LR parser stacks add the state information in purple, summarizing the black phrases to their left on the stack and what syntax possibilities to expect next.  Users of an LR parser can usually ignore state information.  These states are explained in a later section.

{| class="wikitable"
|-
! <small>{{br}}Step</small> !! <small>Parse Stack{{br}}<sub>{{color|#928|state}}</sub> [Symbol<sub>{{color|#928|state}}</sub>]* </small> !! <small>Look{{br}}Ahead</small> !! <small>{{br}}Unscanned</small> !! <small>Parser{{br}}Action</small> !! <small>{{br}}Grammar Rule</small>|| <small>Next{{br}}State</small>
|-
|align="center"| 0 || <sub>{{color|#928|0}}</sub>
|| ''id'' || align="right" | * 2 + 1 || shift ||  ||align="center"| {{color|#928|9}}
|-
|align="center"| 1 || <sub>{{color|#928|0}}</sub> ''id''<sub>{{color|#928|9}}</sub> 
|| *  ||  align="right" | 2 + 1 || reduce || Value → ''id'' ||align="center"| {{color|#928|7}}
|-
|align="center"| 2 || <sub>{{color|#928|0}}</sub> Value<sub>{{color|#928|7}}</sub>
|| * ||  align="right" | 2 + 1 || reduce || Products → Value ||align="center"| {{color|#928|4}}
|-
|align="center"| 3 || <sub>{{color|#928|0}}</sub> Products<sub>{{color|#928|4}}</sub>
|| * ||  align="right" | 2 + 1 || shift || ||align="center"| {{color|#928|5}}
|-
|align="center"| 4 || <sub>{{color|#928|0}}</sub> Products<sub>{{color|#928|4}}</sub> *<sub>{{color|#928|5}}</sub> 
|| ''int'' ||  align="right" | + 1 || shift ||  ||align="center"| {{color|#928|8}}
|-
|align="center"| 5 || <sub>{{color|#928|0}}</sub> Products<sub>{{color|#928|4}}</sub> *<sub>{{color|#928|5}}</sub> ''int''<sub>{{color|#928|8}}</sub> 
|| + ||  align="right" | 1 || reduce || Value → ''int'' ||align="center"| {{color|#928|6}}
|-
|align="center"| 6 || <sub>{{color|#928|0}}</sub> Products<sub>{{color|#928|4}}</sub> *<sub>{{color|#928|5}}</sub> Value<sub>{{color|#928|6}}</sub>
|| + ||  align="right" | 1 || reduce || Products → Products * Value ||align="center"| {{color|#928|4}}
|-
|align="center"| 7 || <sub>{{color|#928|0}}</sub> Products<sub>{{color|#928|4}}</sub> 
|| + || align="right" | 1 || reduce || Sums → Products ||align="center"| {{color|#928|1}}
|-
|align="center"| 8 || <sub>{{color|#928|0}}</sub> Sums<sub>{{color|#928|1}}</sub> 
|| + || align="right" | 1 || shift ||  ||align="center"| {{color|#928|2}}
|-
|align="center"| 9 || <sub>{{color|#928|0}}</sub> Sums<sub>{{color|#928|1}}</sub> +<sub>{{color|#928|2}}</sub>  
|| ''int'' || align="right" | ''eof'' || shift ||  ||align="center"| {{color|#928|8}}
|- 
|align="center"| 10 || <sub>{{color|#928|0}}</sub> Sums<sub>{{color|#928|1}}</sub> +<sub>{{color|#928|2}}</sub> ''int''<sub>{{color|#928|8}}</sub>
|| ''eof'' ||  || reduce || Value → ''int'' ||align="center"| {{color|#928|7}}
|-
|align="center"| 11 || <sub>{{color|#928|0}}</sub> Sums<sub>{{color|#928|1}}</sub> +<sub>{{color|#928|2}}</sub> Value<sub>{{color|#928|7}}</sub> 
|| ''eof'' ||  || reduce || Products → Value ||align="center"| {{color|#928|3}}
|-
|align="center"| 12 || <sub>{{color|#928|0}}</sub> Sums<sub>{{color|#928|1}}</sub> +<sub>{{color|#928|2}}</sub> Products<sub>{{color|#928|3}}</sub>
|| ''eof'' ||  || reduce || Sums → Sums + Products ||align="center"| {{color|#928|1}}
|-
|align="center"| 13 || <sub>{{color|#928|0}}</sub> Sums<sub>{{color|#928|1}}</sub> 
|| ''eof'' ||  || accept ||  || 
|-
|}

At initial step 0, the input stream "A * 2 + 1" is divided into
* an empty section on the parse stack,
* lookahead text "A" scanned as an ''id'' symbol, and
* the remaining unscanned text "* 2 + 1".
The parse stack begins by holding only initial state 0.  When state 0 sees the lookahead ''id'', it knows to shift that ''id'' onto the stack, and scan the next input symbol '''*''', and advance to state 9.

{{rule}}

At step 4, the total input stream "A * 2 + 1" is currently divided into 
* the parsed section "A&nbsp;*" with 2 stacked phrases Products and '''*''',
* lookahead text "2" scanned as an ''int'' symbol, and
* the remaining unscanned text " + 1".
The states corresponding to the stacked phrases are 0, 4, and 5.  The current, rightmost state on the stack is state 5.  When state 5 sees the lookahead ''int'', it knows to shift that ''int'' onto the stack as its own phrase, and scan the next input symbol '''+''', and advance to state 8.

{{rule}}

At step 12, all of the input stream has been consumed but only partially organized.  The current state is 3.  When state 3 sees the lookahead ''eof'', it knows to apply the completed grammar rule
:: Sums → Sums + Products
by combining the stack's rightmost three phrases for Sums, '''+''', and Products into one thing.  State 3 itself doesn't know what the next state should be.  This is found by going back to state 0, just to the left of the phrase being reduced.  When state 0 sees this new completed instance of a Sums, it advances to state 1 (again).  This consulting of older states is why they are kept on the stack, instead of keeping only the current state.

=== Grammar for the example A * 2 + 1===

LR parsers are constructed from a grammar that formally defines the syntax of the input language as a set of patterns.  The grammar doesn't cover all language rules, such as the size of numbers, or the consistent use of names and their definitions in the context of the whole program.  LR parsers use a [[context-free grammar]] that deals just with local patterns of symbols.

The example grammar used here is a tiny subset of the Java or C language:

::r0: Goal → Sums ''eof''
::r1: Sums → Sums + Products
::r2: Sums → Products
::r3: Products → Products * Value
::r4: Products → Value
::r5: Value → ''int''
::r6: Value → ''id''

The grammar's [[terminal symbol]]s are the multi-character symbols or 'tokens' found in the input stream by a [[lexical analysis|lexical scanner]].  Here these include '''+''' '''*''' and ''int'' for any integer constant, and ''id'' for any identifier name, and ''eof'' for end of input file.  The grammar doesn't care what the ''int'' values or ''id'' spellings are, nor does it care about blanks or line breaks.  The grammar uses these terminal symbols but does not define them.  They are always leaf nodes (at the bottom bushy end) of the parse tree.

The capitalized terms like Sums are [[nonterminal symbol]]s.  These are names for concepts or patterns in the language. They are defined in the grammar and never occur themselves in the input stream.  They are always internal nodes (above the bottom) of the parse tree. They only happen as a result of the parser applying some grammar rule.  Some nonterminals are defined with two or more rules; these are alternative patterns.  Rules can refer back to themselves, which are called ''recursive''.  This grammar uses recursive rules to handle repeated math operators.  Grammars for complete languages use recursive rules to handle lists, parenthesized expressions, and nested statements.

Any given computer language can be described by several different grammars.  An LR(1) parser can handle many but not all common grammars.  It is usually possible to manually modify a grammar so that it fits the limitations of LR(1) parsing and the generator tool.

The grammar for an LR parser must be [[ambiguous grammar|unambiguous]] itself, or must be augmented by tie-breaking precedence rules.  This means there is only one correct way to apply the grammar to a given legal example of the language, resulting in a unique parse tree with just one meaning, and a unique sequence of shift/reduce actions for that example.  LR parsing is not a useful technique for human languages with ambiguous grammars that depend on the interplay of words.  Human languages are better handled by parsers like [[Generalized LR parser]], the [[Earley parser]], or the [[CYK algorithm]] that can simultaneously compute all possible parse trees in one pass.

=== Parse table for the example grammar ===

Most LR parsers are table driven. The parser's program code is a simple generic loop that is the same for all grammars and languages. The knowledge of the grammar and its syntactic implications are encoded into unchanging data tables called '''parse tables''' (or '''parsing tables'''). Entries in a table show whether to shift or reduce (and by which grammar rule), for every legal combination of parser state and lookahead symbol.  The parse tables also tell how to compute the next state, given just a current state and a next symbol.

The parse tables are much larger than the grammar.  LR tables are hard to accurately compute by hand for big grammars.  So they are mechanically derived from the grammar by some [[parser generator]] tool like [[GNU Bison|Bison]].<ref>Flex & Bison: Text Processing Tools, by John Levine, O'Reilly Media 2009.</ref>

Depending on how the states and parsing table are generated, the resulting parser is called either a [[simple LR parser|'''SLR''' (simple LR) parser]], [[LALR parser|'''LALR''' (look-ahead LR) parser]], or [[canonical LR parser]]. LALR parsers handle more grammars than SLR parsers. Canonical LR parsers handle even more grammars, but use many more states and much larger tables.  The example grammar is SLR.

LR parse tables are two-dimensional.  Each current LR(0) parser state has its own row.  Each possible next symbol has its own column.  Some combinations of state and next symbol are not possible for valid input streams.  These blank cells trigger syntax error messages.

The '''Action''' left half of the table has columns for lookahead terminal symbols.  These cells determine whether the next parser action is shift (to state ''n''), or reduce (by grammar rule '''r'''<sub>''n''</sub>).

The '''Goto''' right half of the table has columns for nonterminal symbols.  These cells show which state to advance to, after some reduction's Left Hand Side has created an expected new instance of that symbol.  This is like a shift action but for nonterminals; the lookahead terminal symbol is unchanged.

The table column "Current Rules" documents the meaning and syntax possibilities for each state, as worked out by the parser generator.  It is not included in the actual tables used at parsing time.  The <big>{{color|#f7f|•}}</big> (pink dot) marker shows where the parser is now, within some partially recognized grammar rules.  The things to the left of <big>{{color|#f7f|•}}</big> have been parsed, and the things to the right are expected soon.  A state has several such current rules if the parser has not yet narrowed possibilities down to a single rule.

{| class="wikitable"
|-
! Curr !!  !! colspan="5" | Lookahead !! !! colspan="3" | LHS Goto
|-
! State !! Current Rules !! ''int'' !! ''id'' !! *  &nbsp; !! + &nbsp; !! ''eof'' !! !! Sums !! Products !! Value
|-
| 0 || Goal → <big>{{color|#f7f|•}}</big> Sums ''eof'' || 8 || 9 ||  ||  ||  |||| 1 || 4 || 7 
|-
| 1 || Goal → Sums <big>{{color|#f7f|•}}</big> ''eof'' {{br}} Sums → Sums <big>{{color|#f7f|•}}</big> + Products ||  ||  ||  || {{br}}2 || accept{{br}}&nbsp; ||||  ||  ||  
|-
| 2 || Sums → Sums + <big>{{color|#f7f|•}}</big> Products || 8 || 9 ||  ||  ||  ||||  || 3 || 7 
|-
| 3 || Sums → Sums + Products <big>{{color|#f7f|•}}</big> {{br}} Products → Products <big>{{color|#f7f|•}}</big> * Value ||  ||  || {{br}}5 || r1 {{br}}&nbsp; || r1 {{br}}&nbsp; ||||  ||  ||  
|-
| 4 || Sums → Products <big>{{color|#f7f|•}}</big> {{br}} Products → Products <big>{{color|#f7f|•}}</big> * Value ||  ||  || {{br}}5 || r2 {{br}}&nbsp; || r2 {{br}}&nbsp; ||||  ||  ||  
|-
| 5 || Products → Products * <big>{{color|#f7f|•}}</big> Value || 8 || 9 ||  ||  ||  ||||  ||  || 6 
|-
| 6 || Products → Products * Value <big>{{color|#f7f|•}}</big> ||  ||  || r3 || r3 || r3 ||||  ||  ||  
|-
| 7 || Products → Value <big>{{color|#f7f|•}}</big> ||  ||  || r4 || r4 || r4 ||||  ||  ||  
|-
| 8 || Value → ''int'' <big>{{color|#f7f|•}}</big> ||  ||  || r5 || r5 || r5 ||||  ||  ||  
|-
| 9 || Value → ''id'' <big>{{color|#f7f|•}}</big> ||  ||  || r6 || r6 || r6 ||||  ||  ||  
|-
|}

In state 2 above, the parser has just found and shifted-in the '''+''' of grammar rule
::r1: Sums → Sums + <big>{{color|#f7f|•}}</big> Products
The next expected phrase is Products.  Products begins with terminal symbols ''int'' or ''id''.  If the lookahead is either of those, the parser shifts them in and advances to state 8 or 9, respectively.  When a Products has been found, the parser advances to state 3 to accumulate the complete list of summands and find the end of rule r0.  A Products can also begin with nonterminal  Value.  For any other lookahead or nonterminal, the parser announces a syntax error.

{{rule}}

In state 3, the parser has just found a Products phrase, that could be from two possible grammar rules:
::r1: Sums → Sums + Products <big>{{color|#f7f|•}}</big>
::r3: Products → Products <big>{{color|#f7f|•}}</big> * Value
The choice between r1 and r3 can't be decided just from looking backwards at prior phrases.  The parser has to check the lookahead symbol to tell what to do.  If the lookahead is '''*''', it is in rule 3, so the parser shifts in the '''*''' and advances to state 5.  If the lookahead is ''eof'', it is at the end of rule 1 and rule 0, so the parser is done.

{{rule}}

In state 9 above, all the non-blank, non-error cells are for the same reduction r6.  Some parsers save time and table space by not checking the lookahead symbol in these simple cases.  Syntax errors are then detected somewhat later, after some harmless reductions, but still before the next shift action or parser decision.

Individual table cells must not hold multiple, alternative actions, otherwise the parser would be nondeterministic with guesswork and backtracking.  If the grammar is not LR(1), some cells will have shift/reduce conflicts between a possible shift action and reduce action, or reduce/reduce conflicts between multiple grammar rules.  LR(''k'') parsers resolve these conflicts (where possible) by checking additional lookahead symbols beyond the first.

=== LR parser loop ===

The LR parser begins with a nearly empty parse stack containing just the start state 0, and with the lookahead holding the input stream's first scanned symbol.  The parser then repeats the following loop step until done, or stuck on a syntax error:

The topmost state on the parse stack is some state ''s'', and the current lookahead is some terminal symbol ''t''.  Look up the next parser action from row ''s'' and column ''t'' of the Lookahead Action table.  That action is either Shift, Reduce, Accept, or Error:
* Shift ''n'':
::Shift the matched terminal ''t'' onto the parse stack and scan the next input symbol into the lookahead buffer.
::Push next state ''n'' onto the parse stack as the new current state.
* Reduce r<sub>''m''</sub>:  Apply grammar rule r<sub>''m''</sub>: Lhs → S<sub>1</sub> S<sub>2</sub> ... S<sub>L</sub>
::Remove the matched topmost L symbols (and parse trees and associated state numbers) from the parse stack.
::This exposes a prior state ''p'' that was expecting an instance of the Lhs symbol.
::Join the L parse trees together as one parse tree with new root symbol Lhs.
::Lookup the next state ''n'' from row ''p'' and column ''Lhs'' of the LHS Goto table.
::Push the symbol and tree for Lhs onto the parse stack.
::Push next state ''n'' onto the parse stack as the new current state.
::The lookahead and input stream remain unchanged.
* Accept: Lookahead ''t'' is the ''eof'' marker.  End of parsing. If the state stack contains just the start state report success. Otherwise, report a syntax error.
* Error: Report a syntax error.  The parser ends, or attempts some recovery.

LR parser stack usually stores just the LR(0) automaton states, as the grammar symbols may be derived from them (in the automaton, all input transitions to some state are marked with the same symbol, which is the symbol associated with this state). Moreover, these symbols are almost never needed as the state is all that matters when making the parsing decision.<ref name="Compilers 2006">Compilers: Principles, Techniques, and Tools (2nd Edition), by Alfred Aho, Monica Lam, Ravi Sethi, and Jeffrey Ullman, Prentice Hall 2006.</ref>