Editing Canonical LR parser (section)

== Constructing LR(1) parsing tables ==
LR(1) parsing tables are constructed in the same way as [[LR parser#Constructing LR(0) parsing tables|LR(0) parsing tables]] with the modification that each [[LR parser#Items|Item]] contains a lookahead [[Terminal symbol|terminal]]. This means, contrary to LR(0) parsers, a different action may be executed, if the item to process is followed by a different terminal.

=== Parser items ===
Starting from the [[Formal grammar#The syntax of grammars|production rules]] of a language, at first the item sets for this language have to be determined. In plain words, an item set is the list of production rules, which the currently processed symbol might be part of. An item set has a one-to-one correspondence to a parser state, while the items within the set, together with the next symbol, are used to decide which state transitions and parser action are to be applied. Each item contains a marker, to note at which point the currently processed symbol appears in the rule the item represents. For LR(1) parsers, each item is specific to a lookahead terminal, thus the lookahead terminal has also been noted inside each item.

For example, assume a language consisting of the terminal symbols 'n', '+', '(', ')', the nonterminals 'E', 'T', the starting rule 'S' and the following production rules:
: S → E
: E → T
: E → ( E )
: T → n
: T → + T
: T → T + n
Items sets will be generated by analog to the procedure for LR(0) parsers. The item set 0 which represents the initial state will be created from the starting rule:
: [S → • E, $]
The dot '•' denotes the marker of the current parsing position within this rule. The expected lookahead terminal to apply this rule is noted after the comma. The '$' sign is used to denote 'end of input' is expected, as is the case for the starting rule.

This is not the complete item set 0, though. Each item set must be 'closed', which means all production rules for each nonterminal following a '•' have to be recursively included into the item set until all of those nonterminals are dealt with. The resulting item set is called the closure of the item set we began with.

For LR(1) for each production rule an item has to be included for each possible lookahead terminal following the rule. For more complex languages this usually results in very large item sets, which is the reason for the large memory requirements of LR(1) parsers.

In our example, the starting symbol requires the nonterminal 'E' which in turn requires 'T', thus all production rules will appear in item set 0. At first, we ignore the problem of finding the lookaheads and just look at the case of an LR(0), whose items do not contain lookahead terminals. So the item set 0 (without lookaheads) will look like this:
: [S → • E]
: [E → • T]
: [E → • ( E )]
: [T → • n]
: [T → • + T]
: [T → • T + n]

=== FIRST and FOLLOW sets ===
To determine lookahead terminals, so-called FIRST and FOLLOW sets are used.
FIRST(A) is the set of terminals which can appear as the first element of any chain of rules  matching nonterminal A. FOLLOW(I) of an Item I [A → α • B β, x] is the set of terminals that can appear immediately after [[nonterminal]] B, where α, β are  arbitrary symbol strings, and x is an arbitrary lookahead terminal. FOLLOW(k,B) of an item set k and a nonterminal B is the union of the follow sets of all items in k where '•' is followed by B. The FIRST sets can be determined directly from the closures of all nonterminals in the language, while the FOLLOW sets are determined from the items under usage of the FIRST sets.

In our example, as one can verify from the full list of item sets below, the first sets are:
: FIRST(S) = { n, '+', '(' }
: FIRST(E) = { n, '+', '(' }
: FIRST(T) = { n, '+' }

=== Determining lookahead terminals ===
Within item set 0 the follow sets can be found to be:
: FOLLOW(0,S) = { $ }
: FOLLOW(0,E) = { $, ')'}
: FOLLOW(0,T) = { $, '+', ')' }
From this the full item set 0 for an LR(1) parser can be created, by creating for each item in the LR(0) item set one copy for each terminal in the follow set of the LHS nonterminal. Each element of the follow set may be a valid lookahead terminal:
: [S → • E, $]
: [E → • T, $]
: [E → • T, )]
: [E → • ( E ), $]
: [E → • ( E ), )]
: [T → • n, $]
: [T → • n, +]
: [T → • n, )]
: [T → • + T, $]
: [T → • + T, +]
: [T → • + T, )]
: [T → • T + n, $]
: [T → • T + n, +]
: [T → • T + n, )]

=== Creating new item sets ===
The rest of the item sets can be created by the following algorithm
: 1. For each terminal and nonterminal symbol A appearing after a '•' in each already existing item set k, create a new item set m by adding to m all the rules of k where '•' is followed by A, but only if m will not be the same as an already existing item set after step 3.
: 2. shift all the '•'s for each rule in the new item set one symbol to the right
: 3. create the closure of the new item set
: 4. Repeat from step 1 for all newly created item sets, until no more new sets appear

In the example we get 5 more sets from item set 0, item set 1 for nonterminal E, item set 2 for nonterminal T, item set 3 for terminal n, item set 4 for terminal '+' and item set 5 for '('.

Item set 1 (E):

: [S → E •, $]

Item set 2 (T):

: [E → T •, $]
: [T → T • + n, $]
: [T → T • + n, +]
: ·
: ·
: ·

Item set 3 (n):

: [T → n •, $]
: [T → n •, +]
: [T → n •, )]

Item set 4 ('+'):

: [T → + • T, $]
: [T → + • T, +]
: [T → • n, $]
: [T → • n, +]
: [T → • + T, $]
: [T → • + T, +]
: [T → • T + n, $]
: [T → • T + n, +]
: ·
: ·
: ·

Item set 5 ('('):

: [E → ( • E ), $]
: [E → • T, )]
: [E → • ( E ), )]
: [T → • n, )]
: [T → • n, +]
: [T → • + T, )]
: [T → • + T, +]
: [T → • T + n, )]
: [T → • T + n, +]
: ·
: ·
: ·

From item sets 2, 4 and 5 several more item sets will be produced. The complete list is quite long and thus will not be stated here. Detailed LR(k) treatment of this grammar can e.g. be found in [http://david.tribble.com/text/lrk_parsing.html].

=== Goto ===
The lookahead of an LR(1) item is used directly only when considering reduce actions (i.e., when the • marker is at the right end).

The '''core''' of an LR(1) item [S → a A • B e, c] is the LR(0) item S → a A • B e. Different LR(1) items may share the same core.

For example, in item set 2

: [E → T •, $]
: [T → T • + n, $]
: [T → T • + n, +]

the parser is required to perform the reduction [E → T] if the next symbol is '$', but to do a shift if the next symbol is '+'. Note that a LR(0) parser would not be able to make this decision, as it only considers the core of the items, and would thus report a shift/reduce conflict.

A state containing [A → α • X β, a] will move to a state containing [A → α X • β, a] with label X.

Every state has transitions according to Goto.

=== Shift actions ===
If [A → α • b β, a] is in state I<sub>k</sub> and I<sub>k</sub> moves to state I<sub>m</sub> with label b, then we add the action
: action[I<sub>k</sub>, b] = "shift m"

=== Reduce actions ===
If [A→α •, a] is in state I<sub>k</sub>, then we add the action
: action[I<sub>k</sub>, a] = "reduce A → α"