Packrat parser

Template:Short description Template:Infobox algorithm

The Packrat parser is a type of parser that shares similarities with the recursive descent parser in its construction. However, it differs because it takes parsing expression grammars (PEGs) as input rather than LL grammars.<ref name=":3">Template:Cite arXiv</ref>

In 1970, Alexander Birman laid the groundwork for packrat parsing by introducing the "TMG recognition scheme" (TS), and "generalized TS" (gTS). TS was based upon Robert M. McClure's TMG compiler-compiler, and gTS was based upon Dewey Val Schorre's META compiler-compiler. Birman's work was later refined by Aho and Ullman; and renamed as Top-Down Parsing Language (TDPL), and Generalized TDPL (GTDPL), respectively. These algorithms were the first of their kind to employ deterministic top-down parsing with backtracking.<ref name=":1">Template:Cite book</ref><ref name=":0">{{#invoke:citation/CS1|citation |CitationClass=web }}</ref>

Bryan Ford developed PEGs as an expansion of GTDPL and TS. Unlike CFGs, PEGs are unambiguous and can match well with machine-oriented languages. PEGs, similar to GTDPL and TS, can also express all LL(k) and LR(k). Bryan also introduced Packrat as a parser that uses memoization techniques on top of a simple PEG parser. This was done because PEGs have an unlimited lookahead capability resulting in a parser with exponential time performance in the worst case.<ref name=":1" /><ref name=":0" />

Packrat keeps track of the intermediate results for all mutually recursive parsing functions. Each parsing function is only called once at a specific input position. In some instances of packrat implementation, if there is insufficient memory, certain parsing functions may need to be called multiple times at the same input position, causing the parser to take longer than linear time.<ref>Template:Cite book</ref>

SyntaxEdit

Template:See also

The packrat parser takes in input the same syntax as PEGs: a simple PEG is composed of terminal and nonterminal symbols, possibly interleaved with operators that compose one or several derivation rules.<ref name=":1" />

SymbolsEdit

Nonterminal symbols are indicated with capital letters (e.g., <math>\{S, E, F, D\}</math>)
Terminal symbols are indicated with lowercase (e.g., <math>\{a,b,z,e,g \}</math>)
Expressions are indicated with lower-case Greek letter (e.g., <math>\{\alpha,\beta,\gamma,\omega,\tau\}</math>)
- Expressions can be a mix of terminal symbols, nonterminal symbols and operators

OperatorsEdit

Syntax Rules
Operator	Semantics
Sequence <math>\alpha\beta</math>	Success: If <math>\alpha</math> and <math>\beta</math> are recognized Failure: If <math>\alpha</math> or <math>\beta</math> are not recognized Consumed: <math>\alpha</math> and <math>\beta</math> in case of success
Ordered choice <math>\alpha/\beta/\gamma</math>	Success: If any of <math>\{\alpha,\beta,\gamma\}</math> is recognized starting from the left Failure: All of <math>\{\alpha,\beta,\gamma\}</math> do not match Consumed: The atomic expression that has generated a success so if multiple succeed the first one is always returned
And predicate <math>\&\alpha</math>	Success: If <math>\alpha</math> is recognized Failure: If <math>\alpha</math> is not recognized Consumed: No input is consumed
Not predicate <math>!\alpha</math>	Success: If <math>\alpha</math> is not recognized Failure: If <math>\alpha</math> is recognized Consumed: No input is consumed
One or more <math>\alpha +</math>	Success: Try to recognize <math>\alpha</math> one or multiple time Failure: If <math>\alpha</math> is not recognized Consumed: The maximum number that <math>\alpha </math> is recognized
Zero or more <math>\alpha *</math>	Success: Try to recognize <math>\alpha</math> zero or multiple time Failure: Cannot fail Consumed: The maximum number that <math>\alpha </math> is recognized
Zero or one <math>\alpha ?</math>	Success: Try to recognize <math>\alpha</math> zero or once Failure: Cannot fail Consumed: <math>\alpha</math> if it is recognized
Terminal range [<math>a-b</math>]	Success: Recognize any terminal <math>c</math> that are inside the range <math>[a-b]</math>. In the case of <math> [\textbf{'} h \textbf{'} - \textbf{'} z \textbf{'}] </math>, <math>c</math> can be any letter from h to z Failure: If no terminal inside of <math>[a-b]</math> can be recognized Consumed: <math>c</math> if it is recognized
Any character <math> . </math>	Success: Recognize any character in the input Failure: If no character in the input Consumed: any character in the input

RulesEdit

A derivation rule is composed by a nonterminal symbol and an expression <math>S \rightarrow \alpha</math>.

A special expression <math>\alpha_s</math> is the starting point of the grammar.<ref name=":1" /> In case no <math>\alpha_s</math> is specified, the first expression of the first rule is used.

An input string is considered accepted by the parser if the <math> \alpha_s </math> is recognized. As a side-effect, a string <math> x </math> can be recognized by the parser even if it was not fully consumed.<ref name=":1" />

An extreme case of this rule is that the grammar <math> S \rightarrow x* </math> matches any string.

This can be avoided by rewriting the grammar as <math> S \rightarrow x*!. </math>

ExampleEdit

<math>\begin{cases}

   S \rightarrow A/B/D \\
   A \rightarrow \texttt{'a'}\  S \ \texttt{'a'} \\
   B \rightarrow \texttt{'b'}\  S \ \texttt{'b'} \\
   D \rightarrow (\texttt{'0'}-\texttt{'9'})?

\end{cases}</math>

This grammar recognizes a palindrome over the alphabet <math> \{ a,b \} </math>, with an optional digit in the middle.

Example strings accepted by the grammar include: <math> \texttt{'aa'} </math> and <math> \texttt{'aba3aba'} </math>.

Left recursionEdit

Left recursion happens when a grammar production refers to itself as its left-most element, either directly or indirectly. Since Packrat is a recursive descent parser, it cannot handle left recursion directly.<ref name=":2">Template:Cite book</ref> During the early stages of development, it was found that a production that is left-recursive can be transformed into a right-recursive production.<ref>Template:Cite book</ref> This modification significantly simplifies the task of a Packrat parser. Nonetheless, if there is an indirect left recursion involved, the process of rewriting can be quite complex and challenging. If the time complexity requirements are loosened from linear to superlinear, it is possible to modify the memoization table of a Packrat parser to permit left recursion, without altering the input grammar.<ref name=":2" />

Iterative combinatorEdit

The iterative combinators <math>\alpha +</math> and <math>\alpha *</math> need special attention when used in a Packrat parser: these combinators introduce a secret recursion that does not record intermediate results in the outcome matrix, which can lead to the parser operating with a superlinear behaviour. This problem can be resolved by applying the following transformation:<ref name=":3" />


Original	Translated
<math>S \rightarrow \alpha +</math>	<math>S \rightarrow \alpha S / \alpha </math>
<math>S \rightarrow \alpha *</math>	<math>S \rightarrow \alpha S / \epsilon</math>

With this transformation, the intermediate results can be properly memoized.

Memoization techniqueEdit

Memoization is an optimization technique in computing that aims to speed up programs by storing the results of expensive function calls. This technique essentially works by caching the results so that when the same inputs occur again, the cached result is simply returned, thus avoiding the time-consuming process of re-computing.<ref>Template:Cite journal</ref> When using packrat parsing and memoization, it's noteworthy that the parsing function for each nonterminal is solely based on the input string. It does not depend on any information gathered during the parsing process. Essentially, memoization table entries do not affect or rely on the parser's specific state at any given time.<ref>Template:Cite book</ref>

Packrat parsing stores results in a matrix or similar data structure that allows for quick look-ups and insertions. When a production is encountered, the matrix is checked to see if it has already occurred. If it has, the result is retrieved from the matrix. If not, the production is evaluated, the result is inserted into the matrix, and then returned.<ref name=":4">Template:Cite journal</ref> When evaluating the entire <math>m*n</math> matrix in a tabular approach, it would require <math>\Theta(mn)</math> space.<ref name=":4" /> Here, <math>m</math> represents the number of nonterminals, and <math>n</math> represents the input string size.

In a naïve implementation, the entire table can be derived from the input string starting from the end of the string.

The Packrat parser can be improved to update only the necessary cells in the matrix through a depth-first visit of each subexpression tree. Consequently, using a matrix with dimensions of <math>m*n</math> is often wasteful, as most entries would remain empty.<ref name=":2" /> These cells are linked to the input string, not to the nonterminals of the grammar. This means that increasing the input string size would always increase memory consumption, while the number of parsing rules changes only the worst space complexity.<ref name=":3" />

Cut operatorEdit

Another operator called cut has been introduced to Packrat to reduce its average space complexity even further. This operator utilizes the formal structures of many programming languages to eliminate impossible derivations. For instance, control statements parsing in a standard programming language is mutually exclusive from the first recognized token, e.g.,<math>\{\mathtt{if, do, while, switch}\}</math>.<ref name=":5">Template:Cite book</ref>

Operator	Semantics
Cut <math>\begin{array}{l} \alpha \uparrow \beta / \gamma \\ (\alpha \uparrow \beta)* \end{array} </math>	if <math>\alpha </math> is recognized but <math>\beta </math> is not, skip the evaluation of the alternative. In the first case don't evaluate <math>\gamma </math> if <math>\alpha </math> was recognized The second rule is can be rewritten as <math>N \rightarrow \alpha \uparrow \beta N / \epsilon </math> and the same rules can be applied.

When a Packrat parser uses cut operators, it effectively clears its backtracking stack. This is because a cut operator reduces the number of possible alternatives in an ordered choice. By adding cut operators in the right places in a grammar's definition, the resulting Packrat parser only needs a nearly constant amount of space for memoization.<ref name=":5" />

The algorithmEdit

Sketch of an implementation of a Packrat algorithm in a Lua-like pseudocode.<ref name=":2" /> <syntaxhighlight lang="lua" line highlight="1,3,11" style="font-size:95%"> INPUT(n) -- return the character at position n

RULE(R : Rule, P : Position )

   entry = GET_MEMO(R,P) -- return the number of elements previously matched in rule R at position P
   if entry == nil then
       return EVAL(R, P);
   end
   return entry;

EVAL(R : Rule, P : Position )

   start = P;   
   for choice in R.choices -- Return a list of choice
       acc=0;
       for symbol in choice then -- Return each element of a rule, terminal and nonterminal
           if symbol.is_terminal then
               if INPUT(start+acc) == symbol.terminal then
                   acc = acc + 1; --Found correct terminal skip pass it
               else
                   break;                
               end
           else 
               res = RULE(symbol.nonterminal , start+acc ); -- try to recognize a nonterminal in position start+acc
               SET_MEMO(symbol.nonterminal , start+acc, res ); -- we memoize also the failure with special value fail
               if res == fail then  
                   break; 
               end
               acc = acc + res;
           end
           if symbol == choice.last -- check if we have matched the last symbol in a choice if so return
               return acc;
       end
   end
   return fail; --if no choice match return fail

</syntaxhighlight>

ExampleEdit

Given the following context, a free grammar that recognizes simple arithmetic expressions composed of single digits interleaved by sum, multiplication, and parenthesis.

<math>\begin{cases}

   S \rightarrow A \\
   A \rightarrow M\  \texttt{'+'}\  A \ / \  M \\
   M \rightarrow P\  \texttt{'*'}\  M \ / \ P \\
   P \rightarrow \texttt{'('}\ A\ \texttt{')'}\  / \ D \\
   D \rightarrow (\texttt{'0'}-\texttt{'9'})

\end{cases}</math>

Denoted with ⊣ the line terminator we can apply the packrat algorithm

Derivation of Template:Kbd

Syntax tree

Action

Packrat Table

File:Derivation of a context free grammar with packrat.svg

Derivation Rules	Input shifted
<math> \begin{array}{l} S \rightarrow A \\ A \rightarrow M\ \texttt{'+'}\ A \\ M \rightarrow P\ \texttt{'*'}\ M \\ P \rightarrow \texttt{'('}\ A\ \texttt{')'} \end{array} </math>	ɛ
Notes	Input left
Input doesn't match the first element in the derivation. Backtrack to the first grammar rule with unexplored alternative <math display="inline"> P \rightarrow \texttt{'('}\ A\ \texttt{')'}\ / \ \underline{D} </math>	Template:Nowrap


	Index
	1	2	3	4	5	6	7
S
A
M
P
D
	2	*	(	3	+	4	)

No update because no terminal was recognized

File:Second step in Parsing a CFG with packrat.svg

Derivation Rules	Input shifted
<math> P \rightarrow D </math> <math> D \rightarrow 2 </math>	Template:Kbd
Notes	Input left
Shift input by one after deriving terminal Template:Mono	Template:Kbd

	Index
	1	2	3	4	5	6	7
S
A
M
P	1
D	1
	2	*	(	3	+	4	)

Update:

D(1) = 1;

P(1) = 1;

File:Third step of recognizing CFG with packrat.svg

Derivation Rules	Input shifted
<math> M \rightarrow P\ \texttt{'*'}\ M </math> <math>P \rightarrow \texttt{'('}\ A\ \texttt{')'}</math>	Template:Kbd
Notes	Input left
Shift input by two terminal <math> \{\texttt{*}, \texttt{(}\} </math>	Template:Kbd

	Index
	1	2	3	4	5	6	7
S
A
M
P	1
D	1
	2	*	(	3	+	4	)

No update because no nonterminal was fully recognized

File:Fourth step in recognizing CFG grammar with Packrat.svg

Derivation Rules	Input shifted
<math>A \rightarrow M\ \texttt{'+'}\ A</math> <math>M \rightarrow P\ \texttt{'*'}\ M</math> <math>P \rightarrow \texttt{'('}\ A\ \texttt{')'}</math>	Template:Kbd
Notes	Input left
Input doesn't match the first element in the derivation. Backtrack to the first grammar rule with unexplored alternative <math display="inline"> P \rightarrow \texttt{'('}\ A\ \texttt{')'}\ / \ \underline{D} </math>	Template:Kbd

	Index
	1	2	3	4	5	6	7
S
A
M
P	1
D	1
	2	*	(	3	+	4	)

No update because no terminal was recognized

File:5th step of recognizing CFG with packrat.svg

Derivation Rules	Input shifted
<math>P \rightarrow D </math> <math>D \rightarrow 3</math>	Template:Kbd
Notes	Input left
Shift input by one after deriving terminal Template:Mono but the new input will not match inside <math>M \rightarrow P\ \texttt{''}\ M</math> so an unroll is necessary to <math> M \rightarrow P\ \texttt{''}\ M \ / \ \underline P </math>	Template:Kbd

	Index
	1	2	3	4	5	6	7
S
A
M
P	1			1
D	1			1
	2	*	(	3	+	4	)

Update:

D(4) = 1;

P(4) = 1;

File:Sixth step of recognizing CFG with packrat.svg

Derivation Rules	Input shifted
<math>M \rightarrow P </math>	Template:Nowrap
Notes	Input left
Roll Back to <math> M \rightarrow P\ \texttt{'*'}\ M \ / \ \underline P </math> And we don't expand it has we have an hit in the memoization table P(4) ≠ 0 so shift the input by P(4). Shift also the <math>+</math> from <math>A \rightarrow M\ \texttt{'+'}\ A</math>	Template:Kbd

	Index
	1	2	3	4	5	6	7
S
A
M				1
P	1			1
D	1			1
	2	*	(	3	+	4	)

Hit on P(4)

Update M(4) = 1 as M was recognized

File:Seventh step of recognizing CFG with packrat.svg

Derivation Rules	Input shifted
<math>A \rightarrow M\ \texttt{'+'}\ A</math> <math>M \rightarrow P\ \texttt{'*'}\ M</math> <math>P \rightarrow \texttt{'('}\ A\ \texttt{')'}</math>	Template:Nowrap
Notes	Input left
Input doesn't match the first element in the derivation. Backtrack to the first grammar rule with unexplored alternative <math display="inline"> P \rightarrow \texttt{'('}\ A\ \texttt{')'}\ / \ \underline{D} </math>	Template:Kbd

	Index
	1	2	3	4	5	6	7
S
A
M				1
P	1			1
D	1			1
	2	*	(	3	+	4	)

No update because no terminal was recognized

File:Eighth step of recognizing CFG with packrat.svg

Derivation Rules	Input shifted
<math>P \rightarrow D </math> <math>D \rightarrow 4</math>	Template:Kbd
Notes	Input left
Shift input by one after deriving terminal Template:Mono but the new input will not match inside <math>M \rightarrow P\ \texttt{'*'}\ M</math> so an unroll is necessary	Template:Kbd

	Index
	1	2	3	4	5	6	7
S
A
M				1
P	1			1		1
D	1			1		1
	2	*	(	3	+	4	)

Update:

D(6) = 1;

P(6) = 1;

File:Ninth step of recognizing CFG with packrat.svg

Derivation Rules	Input shifted
<math>M \rightarrow P </math>	Template:Nowrap
Notes	Input left
Roll Back to <math> M \rightarrow P\ \texttt{'*'}\ M \ / \ \underline P </math> And we don't expand it has we have an hit in the memoization table P(6) ≠ 0 so shift the input by P(6). but the new input will not match <math>+</math> inside <math>A \rightarrow M\ \texttt{'+'}\ A </math> so an unroll is necessary	Template:Kbd

	Index
	1	2	3	4	5	6	7
S
A
M				1		1
P	1			1		1
D	1			1		1
	2	*	(	3	+	4	)

Hit on P(6)

Update M(6) = 1 as M was recognized

File:Tenth step of recognizing CFG with packrat.svg

Derivation Rules	Input shifted
<math>A \rightarrow M </math>	Template:Nowrap
Notes	Input left
Roll Back to <math>A \rightarrow M\ \texttt{'+'}\ A \ / \ \underline{M} </math> And we don't expand it has we have an hit in the memoization table M(6) ≠ 0 so shift the input by M(6). Also shift <math>)</math> from <math>P \rightarrow \texttt{'('}\ A\ \texttt{')'} </math>	Template:Kbd

	Index
	1	2	3	4	5	6	7
S
A				3
M				1		1
P	1		5	1		1
D	1			1		1
	2	*	(	3	+	4	)

Hit on M(6)

Update A(4) = 3 as A was recognized

Update P(3)=5 as P was recognized

File:Eleventh step of recognizing CFG with packrat.svg

Derivation Rules	Input shifted
	Template:Kbd
Notes	Input left
Roll Back to <math> M \rightarrow P\ \texttt{''}\ M \ / \ \underline P </math> as terminal <math> \neq \dashv</math>	Template:Nowrap

	Index
	1	2	3	4	5	6	7
S
A				3
M				1		1
P	1		5	1		1
D	1			1		1
	2	*	(	3	+	4	)

No update because no terminal was recognized

File:Twelfth step of recognizing CFG with packrat.svg

Derivation Rules	Input shifted
<math>M \rightarrow P </math>	Template:Nowrap
Notes	Input left
we don't expand it as we have a hit in the memoization table P(3) ≠ 0, so shift the input by P(3).	Template:Kbd

	Index
	1	2	3	4	5	6	7
S
A				3
M	7			1		1
P	1		5	1		1
D	1			1		1
	2	*	(	3	+	4	)

Hit on P(3)

Update M(1)=7 as M was recognized

File:13th step of recognizing CFG with packrat.svg

Derivation Rules	Input shifted

Notes	Input left
Roll Back to <math>A \rightarrow M\ \texttt{'+'}\ A \ / \ \underline{M}</math> as terminal <math>+ \neq \dashv</math>	Template:Nowrap

	Index
	1	2	3	4	5	6	7
S
A				3
M	7			1		1
P	1		5	1		1
D	1			1		1
	2	*	(	3	+	4	)

No update because no terminal was recognized

File:14th step of recognizing CFG with packrat.svg

Derivation Rules	Input shifted
<math>A \rightarrow M</math>	Template:Nowrap
Notes	Input left
We don't expand it as we have a hit in the memoization table M(1) ≠ 0, so shift the input by M(1). S was totally reduced, so the input string is recognized.

	Index
	1	2	3	4	5	6	7
S	7
A	7			3
M	7			1		1
P	1		5	1		1
D	1			1		1
	2	*	(	3	+	4	)

Hit on M(1)

Update A(1)=7 as A was recognized

Update S(1)=7 as S was recognized

ImplementationEdit

Template:See also

Name	Parsing algorithm	Output languages	Grammar, code	Development platform	License
AustenX	Packrat (modified)	Java	Template:D-P	Template:Yes	Template:Free, BSD
Aurochs	Packrat	C, OCaml, Java	Template:D-A	Template:Yes	Template:Free, GNU GPL
Canopy	Packrat	Java, JavaScript, Python, Ruby	Template:D-P	Template:Yes	Template:Free, GNU GPL
CL-peg	Packrat	Common Lisp	Template:D-A	Template:Yes	Template:Free, MIT
Drat!	Packrat	D	Template:D-A	Template:Yes	Template:Free, GNU GPL
Frisby	Packrat	Haskell	Template:D-A	Template:Yes	Template:Free, BSD
grammar::peg	Packrat	Tcl	Template:D-A	Template:Yes	Template:Free, BSD
IronMeta	Packrat	C#	Template:D-A	Template:Some	Template:Free, BSD
PEGParser	Packrat (supporting left-recursion and grammar ambiguity)	C++	Identical	Template:Yes	Template:Free, BSD
Narwhal	Packrat	C	Template:D-A	Template:Some	Template:Free, BSD
neotoma	Packrat	Erlang	Template:D-P	Template:Yes	Template:Free, MIT
OMeta	Packrat (modified, partial memoization)	JavaScript, Squeak, Python	Template:D-A	Template:Yes	Template:Free, MIT
PackCC	Packrat (modified, left-recursion support)	C	Template:D-A	Template:Yes	Template:Free, MIT
Packrat	Packrat	Scheme	Template:D-A	Template:Yes	Template:Free, MIT
Pappy	Packrat	Haskell	Template:D-A	Template:Yes	Template:Free, BSD
Parsnip	Packrat	C++	Template:D-A	Template:Some	Template:Free, GNU GPL
PEG.js	Packrat (partial memoization)	JavaScript	Template:D-A	Template:Yes	Template:Free, MIT
Peggy<ref>Maintained fork of PEG.js</ref>	Packrat (partial memoization)	JavaScript	Template:D-A	Template:Yes	Template:Free, MIT
Pegasus	Recursive descent, Packrat (selectively)	C#	Template:D-A	Template:Some	Template:Free, MIT
PetitParser	Packrat	Smalltalk, Java, Dart	Template:D-A	Template:Yes	Template:Free, MIT
PyPy rlib	Packrat	Python	Template:D-A	Template:Yes	Template:Free, MIT
Rats!	Packrat	Java	Template:D-A	Template:Some	Template:Free, GNU LGPL
go-packrat	Packrat	Go	Identical	All	Template:Free, GPLv3

ReferencesEdit

Template:Reflist

External linksEdit

Template:Parsers

Packrat parser

Contents

SyntaxEdit

SymbolsEdit

OperatorsEdit

RulesEdit

ExampleEdit

Left recursionEdit

Iterative combinatorEdit

Memoization techniqueEdit

Cut operatorEdit

The algorithmEdit

ExampleEdit

ImplementationEdit

See alsoEdit

ReferencesEdit

External linksEdit