Editing Earley parser (section)

== Constructing the parse forest ==
Earley's dissertation<ref name=Earley3>{{cite book
 | last=Earley
 | first=Jay
 | title=An Efficient Context-Free Parsing Algorithm
 | year=1968
 | publisher=Carnegie-Mellon Dissertation
 | page=106
 | url=http://reports-archive.adm.cs.cmu.edu/anon/anon/usr/ftp/scan/CMU-CS-68-earley.pdf
 | access-date=2012-09-12
 | archive-date=2017-09-22
 | archive-url=https://web.archive.org/web/20170922004954/http://reports-archive.adm.cs.cmu.edu/anon/anon/usr/ftp/scan/CMU-CS-68-earley.pdf
 | url-status=dead
 }}</ref> briefly describes an algorithm for constructing parse trees by adding a set of pointers from each non-terminal in an Earley item back to the items that caused it to be recognized.  But [[Masaru Tomita|Tomita]] noticed<ref>{{cite book|last1=Tomita|first1=Masaru|title=Efficient Parsing for Natural Language: A Fast Algorithm for Practical Systems|date=April 17, 2013|publisher=Springer Science and Business Media|isbn=978-1475718850|page=74|url=https://books.google.com/books?id=DAjkBwAAQBAJ&q=Tomita%20Efficient%20Parsing%20for%20natural%20Language&pg=PA74|access-date=16 September 2015}}</ref> that this does not take into account the relations between symbols, so if we consider the grammar S → SS | b and the string bbb, it only notes that each S can match one or two b's, and thus produces spurious derivations for bb and bbbb as well as the two correct derivations for bbb.

Another method<ref>{{cite journal|last1=Scott|first1=Elizabeth|title=SPPF-Style Parsing From Earley Recognizers|journal=Electronic Notes in Theoretical Computer Science|date=April 1, 2008|volume=203|issue=2|pages=53–67|doi=10.1016/j.entcs.2008.03.044|doi-access=free}}</ref> is to build the parse forest as you go, augmenting each Earley item with a pointer to a shared packed parse forest (SPPF) node labelled with a triple (s, i, j) where s is a symbol or an LR(0) item (production rule with dot), and i and j give the section of the input string derived by this node. A node's contents are either a pair of child pointers giving a single derivation, or a list of "packed" nodes each containing a pair of pointers and representing one derivation.  SPPF nodes are unique (there is only one with a given label), but may contain more than one derivation for [[syntactic ambiguity|ambiguous]] parses.  So even if an operation does not add an Earley item (because it already exists), it may still add a derivation to the item's parse forest.

* Predicted items have a null SPPF pointer.
* The scanner creates an SPPF node representing the non-terminal it is scanning.
* Then when the scanner or completer advance an item, they add a derivation whose children are the node from the item whose dot was advanced, and the one for the new symbol that was advanced over (the non-terminal or completed item).

SPPF nodes are never labeled with a completed LR(0) item: instead they are labelled with the symbol that is produced so that all derivations are combined under one node regardless of which alternative production they come from.