Editing LR parser (section)

== LR generator analysis ==

This section of the article can be skipped by most users of LR parser generators.

=== LR states ===

State 2 in the example parse table is for the partially parsed rule 
:: r1: Sums → Sums + <big>{{color|#f7f|•}}</big> Products
This shows how the parser got here, by seeing Sums then '''+''' while looking for a larger Sums.  The <big>{{color|#f7f|•}}</big> marker has advanced beyond the beginning of the rule.  It also shows how the parser expects to eventually complete the rule, by next finding a complete Products.  But more details are needed on how to parse all the parts of that Products.

The partially parsed rules for a state are called its "core LR(0) items".  The parser generator adds additional rules or items for all the possible next steps in building up the expected Products:
:: r3: Products → <big>{{color|#f7f|•}}</big> Products * Value
:: r4: Products → <big>{{color|#f7f|•}}</big> Value
:: r5: Value → <big>{{color|#f7f|•}}</big> ''int''
:: r6: Value → <big>{{color|#f7f|•}}</big> ''id''
The <big>{{color|#f7f|•}}</big> marker is at the beginning of each of these added rules; the parser has not yet confirmed and parsed any part of them.  These additional items are called the "closure" of the core items. For each nonterminal symbol immediately following a <big>{{color|#f7f|•}}</big>, the generator adds the rules defining that symbol.  This adds more <big>{{color|#f7f|•}}</big> markers, and possibly different follower symbols.  This closure process continues until all follower symbols have been expanded.  The follower nonterminals for state 2 begins with Products.  Value is then added by closure. The follower terminals are ''int'' and ''id''.

The kernel and closure items together show all possible legal ways to proceed from the current state to future states and complete phrases.  If a follower symbol appears in only one item, it leads to a next state containing only one core item with the <big>{{color|#f7f|•}}</big> marker advanced.  So ''int'' leads to next state 8 with core
:: r5: Value → ''int'' <big>{{color|#f7f|•}}</big>

If the same follower symbol appears in several items, the parser cannot yet tell which rule applies here.  So that symbol leads to a next state that shows all remaining possibilities, again with the <big>{{color|#f7f|•}}</big> marker advanced.  Products appears in both r1 and r3.  So Products leads to next state 3 with core
:: r1: Sums → Sums + Products <big>{{color|#f7f|•}}</big>
:: r3: Products → Products <big>{{color|#f7f|•}}</big> * Value
In words, that means if the parser has seen a single Products, it might be done, or it might still have even more things to multiply together. All the core items have the same symbol preceding the <big>{{color|#f7f|•}}</big> marker; all transitions into this state are always with that same symbol.

Some transitions will be to cores and states that have been enumerated already.  Other transitions lead to new states.  The generator starts with the grammar's goal rule.  From there it keeps exploring known states and transitions until all needed states have been found.

These states are called "LR(0)" states because they use a lookahead of ''k''=0, i.e. no lookahead.  The only checking of input symbols occurs when the symbol is shifted in.  Checking of lookaheads for reductions is done separately by the parse table, not by the enumerated states themselves.

=== Finite state machine ===

The parse table describes all possible LR(0) states and their transitions.  They form a [[finite-state machine]] (FSM).  An FSM is a simple engine for parsing simple unnested languages, without using a stack.  In this LR application, the FSM's modified "input language" has both terminal and nonterminal symbols, and covers any partially parsed stack snapshot of the full LR parse.

Recall step 5 of the Parse Steps Example:

{| class="wikitable"
|-
! <small><{{br}}Step</small> !! <small>Parse Stack{{br}}<sub>{{color|#928|state}}</sub> Symbol <sub>{{color|#928|state}}</sub> ...</small> !! <small>Look{{br}}Ahead</small> !! <small>{{br}}Unscanned</small>
|-
| 5 || 
<sub>{{color|#928|0}}</sub> 
Products<sub>{{color|#928|4}}</sub> *<sub>{{color|#928|5}}</sub>
''int''<sub>{{color|#928|8}}</sub> 
|| + ||  align="right" | 1 
|-
|}

The parse stack shows a series of state transitions, from the start state 0, to state 4 and then on to 5 and current state 8.  The symbols on the parse stack are the shift or goto symbols for those transitions.  Another way to view this, is that the finite state machine can scan the stream "Products&nbsp;*&nbsp;''int''&nbsp;+&nbsp;1" (without using yet another stack) and find the leftmost complete phrase that should be reduced next.  And that is indeed its job!

How can a mere FSM do this when the original unparsed language has nesting and recursion and definitely requires an analyzer with a stack?  The trick is that everything to the left of the stack top has already been fully reduced.  This eliminates all the loops and nesting from those phrases.  The FSM can ignore all the older beginnings of phrases, and track just the newest phrases that might be completed next.  The obscure name for this in LR theory is "viable prefix".

=== Lookahead sets ===

The states and transitions give all the needed information for the parse table's shift actions and goto actions.  The generator also needs to calculate the expected lookahead sets for each reduce action.

In '''SLR''' parsers, these lookahead sets are determined directly from the grammar, without considering the individual states and transitions.  For each nonterminal S, the SLR generator works out Follows(S), the set of all the terminal symbols which can immediately follow some occurrence of S.  In the parse table, each reduction to S uses Follow(S) as its LR(1) lookahead set.  Such follow sets are also used by generators for LL top-down parsers.  A grammar that has no shift/reduce or reduce/reduce conflicts when using Follow sets is called an SLR grammar.

'''LALR''' parsers have the same states as SLR parsers, but use a more complicated, more precise way of working out the minimum necessary reduction lookaheads for each individual state.  Depending on the details of the grammar, this may turn out to be the same as the Follow set computed by SLR parser generators, or it may turn out to be a subset of the SLR lookaheads.  Some grammars are okay for LALR parser generators but not for SLR parser generators.  This happens when the grammar has spurious shift/reduce or reduce/reduce conflicts using Follow sets, but no conflicts when using the exact sets computed by the LALR generator.  The grammar is then called LALR(1) but not SLR.

An SLR or LALR parser avoids having duplicate states.  But this minimization is not necessary, and can sometimes create unnecessary lookahead conflicts.  '''Canonical LR''' parsers use duplicated (or "split") states to better remember the left and right context of a nonterminal's use. Each occurrence of a symbol S in the grammar can be treated independently with its own lookahead set, to help resolve reduction conflicts.  This handles a few more grammars.  Unfortunately, this greatly magnifies the size of the parse tables if done for all parts of the grammar.  This splitting of states can also be done manually and selectively with any SLR or LALR parser, by making two or more named copies of some nonterminals.  A grammar that is conflict-free for a canonical LR generator but has conflicts in an LALR generator is called LR(1) but not LALR(1), and not SLR.

SLR, LALR, and canonical LR parsers make exactly the same shift and reduce decisions when the input stream is the correct language.  When the input has a syntax error, the LALR parser may do some additional (harmless) reductions before detecting the error than would the canonical LR parser.  And the SLR parser may do even more.  This happens because the SLR and LALR parsers are using a generous superset approximation to the true, minimal lookahead symbols for that particular state.

=== Syntax error recovery ===

LR parsers can generate somewhat helpful error messages for the first syntax error in a program, by simply enumerating all the terminal symbols that could have appeared next instead of the unexpected bad lookahead symbol.  But this does not help the parser work out how to parse the remainder of the input program to look for further, independent errors.  If the parser recovers badly from the first error, it is very likely to mis-parse everything else and produce a cascade of unhelpful spurious error messages.

In the [[yacc]] and bison parser generators, the parser has an ad hoc mechanism to abandon the current statement, discard some parsed phrases and lookahead tokens surrounding the error, and resynchronize the parse at some reliable statement-level delimiter like semicolons or braces.  This often works well for allowing the parser and compiler to look over the rest of the program.

Many syntactic coding errors are simple typos or omissions of a trivial symbol.  Some LR parsers attempt to detect and automatically repair these common cases.  The parser enumerates every possible single-symbol insertion, deletion, or substitution at the error point.  The compiler does a trial parse with each change to see if it worked okay.  (This requires backtracking to snapshots of the parse stack and input stream, normally unneeded by the parser.)  Some best repair is picked.  This gives a very helpful error message and resynchronizes the parse well.  However, the repair is not trustworthy enough to permanently modify the input file.  Repair of syntax errors is easiest to do consistently in parsers (like LR) that have parse tables and an explicit data stack.

=== Variants of LR parsers ===

The LR parser generator decides what should happen for each combination of parser state and lookahead symbol.  These decisions are usually turned into read-only data tables that drive a generic parser loop that is grammar- and state-independent.  But there are also other ways to turn those decisions into an active parser.

Some LR parser generators create separate tailored program code for each state, rather than a parse table.  These parsers can run several times faster than the generic parser loop in table-driven parsers.  The fastest parsers use generated assembler code.

In the [[recursive ascent parser]] variation, the explicit parse stack structure is also replaced by the implicit stack used by subroutine calls.  Reductions terminate several levels of subroutine calls, which is clumsy in most languages.  So recursive ascent parsers are generally slower, less obvious, and harder to hand-modify than [[recursive descent parser]]s.

Another variation replaces the parse table by pattern-matching rules in non-procedural languages such as [[Prolog]].

'''GLR''' [[Generalized LR parser]]s use LR bottom-up techniques to find all possible parses of input text, not just one correct parse.   This is essential for ambiguous grammar such as used for human languages.  The multiple valid parse trees are computed simultaneously, without backtracking.  GLR is sometimes helpful for computer languages that are not easily described by a conflict-free LALR(1) grammar.

'''LC''' [[Left corner parser]]s use LR bottom-up techniques for recognizing the left end of alternative grammar rules.  When the alternatives have been narrowed down to a single possible rule, the parser then switches to top-down LL(1) techniques for parsing the rest of that rule.  LC parsers have smaller parse tables than LALR parsers and better error diagnostics.  There are no widely used generators for deterministic LC parsers.  Multiple-parse LC parsers are helpful with human languages with very large grammars.

=== Theory ===

LR parsers were invented by [[Donald Knuth]] in 1965 as an efficient generalization of [[simple precedence parser|precedence parser]]s.  Knuth proved that LR parsers were the most general-purpose parsers possible that would still be efficient in the worst cases.{{citation needed|date=September 2019}}  
: "LR(''k'') grammars can be efficiently parsed with an execution time essentially proportional to the length of the string."<ref>Knuth (1965), p.638</ref>
: For every {{nowrap|''k'' ≥ 1}}, "a language can be generated by an LR(''k'') grammar if and only if it is deterministic [and context-free], if and only if it can be generated by an LR(1) grammar."<ref>Knuth (1965), p.635. Knuth didn't mention the restriction {{nowrap|''k'' ≥ 1}} there, but it is required by his theorems he referred to, viz. on p. 629 and p. 630. Similarly, the restriction to [[context-free language]]s is tacitly understood from the context.</ref>

In other words, if a language was reasonable enough to allow an efficient one-pass parser, it could be described by an LR(''k'') grammar.  And that grammar could always be mechanically transformed into an equivalent (but larger) LR(1) grammar.  So an LR(1) parsing method was, in theory, powerful enough to handle any reasonable language.  In practice, the natural grammars for many programming languages are close to being LR(1).{{citation needed|date=June 2012}}

The canonical LR parsers described by Knuth had too many states and very big parse tables that were impractically large for the limited memory of computers of that era.  LR parsing became practical when [[Frank DeRemer]] invented [[Simple LR parser|SLR]] and [[LALR]] parsers with much fewer states.<ref>
Practical Translators for LR(''k'') Languages, by Frank DeRemer, MIT PhD dissertation 1969.</ref><ref>
Simple LR(''k'') Grammars, by Frank DeRemer, Comm. ACM 14:7 1971.</ref>

For full details on LR theory and how LR parsers are derived from grammars, see ''The Theory of Parsing, Translation, and Compiling, Volume 1'' (Aho and Ullman).<ref name="Compilers 2006"/><ref name="AhoUllman 1972" />

[[Earley parser]]s apply the techniques and <big>{{color|#f7f|•}}</big> notation of LR parsers to the task of generating all possible parses for ambiguous grammars such as for human languages.

While LR(''k'') grammars have equal generative power for all ''k''≥1, the case of LR(0) grammars is slightly different.
A language ''L'' is said to have the ''prefix property'' if no word in ''L'' is a [[Prefix (formal languages)|proper prefix]] of another word in ''L''.<ref>{{cite book |last1=Hopcroft |first1=John E. |last2=Ullman |first2=Jeffrey D. |date=1979 |title=Introduction to Automata Theory, Languages, and Computation |publisher=Addison-Wesley |isbn=0-201-02988-X |url=https://archive.org/details/introductiontoau00hopc }} Here: Exercise 5.8, p.121.</ref>
A language ''L'' has an LR(0) grammar if and only if ''L'' is a [[deterministic context-free language]] with the prefix property.<ref>Hopcroft, Ullman (1979), Theorem 10.12, p.260</ref>
As a consequence, a language ''L'' is deterministic context-free if and only if [[Concatenation#Concatenation of sets of strings|''L''$]] has an LR(0) grammar, where "$" is not a symbol of ''L''{{'}}s [[Alphabet (formal languages)|alphabet]].<ref>Hopcroft, Ullman (1979), Corollary p.260</ref>