Editing Parsing expression grammar (section)

== Definition ==
{{No footnotes|section|date=December 2022}}
A '''parsing expression''' is a kind of pattern that each string may either '''match''' or '''not match'''. In case of a match, there is a unique prefix of the string (which may be the whole string, the empty string, or something in between) which has been ''consumed'' by the parsing expression; this prefix is what one would usually think of as having matched the expression. However, whether a string matches a parsing expression ''may'' (because of look-ahead predicates) depend on parts of it which come after the consumed part. A '''parsing expression language''' is a set of all strings that match some specific parsing expression.<ref name="For04" />{{rp|Sec.3.4}}

A '''parsing expression grammar''' is a collection of named parsing expressions, which may reference each other. The effect of one such reference in a parsing expression is as if the whole referenced parsing expression was given in place of the reference. A parsing expression grammar also has a designated '''starting expression'''; a string matches the grammar if it matches its starting expression.

An element of a string matched is called a ''[[terminal symbol]]'', or '''terminal''' for short. Likewise the names assigned to parsing expressions are called ''[[nonterminal symbol]]s'', or '''nonterminals''' for short. These terms would be descriptive for [[Chomsky hierarchy|generative grammars]], but in the case of parsing expression grammars they are merely terminology, kept mostly because of being near ubiquitous in discussions of [[parsing]] algorithms.

=== Syntax ===
Both ''abstract'' and ''concrete'' syntaxes of parsing expressions are seen in the literature, and in this article. The abstract syntax is essentially a [[expression (mathematics)|mathematical formula]] and primarily used in theoretical contexts, whereas concrete syntax parsing expressions could be used directly to control a [[parser]]. The primary concrete syntax is that defined by Ford,<ref name="For04"/>{{rp|Fig.1}} although many tools have their own dialect of this. Other tools<ref>{{cite web |last1=Sirthias |first1=Mathias |title=Parboiled: Rule Construction in Java |website=[[GitHub]] |url=https://github.com/sirthias/parboiled/wiki/Rule-Construction-in-Java |access-date=13 January 2024}}</ref> can be closer to using a programming-language native encoding of abstract syntax parsing expressions as their concrete syntax.

==== Atomic parsing expressions ====
The two main kinds of parsing expressions not containing another parsing expression are individual terminal symbols and nonterminal symbols. In concrete syntax, terminals are placed inside quotes (single or double), whereas identifiers not in quotes denote nonterminals:
<syntaxhighlight lang="peg">
  "terminal" Nonterminal 'another terminal'
</syntaxhighlight>
In the abstract syntax there is no formalised distinction, instead each symbol is supposedly defined as either terminal or nonterminal, but a common convention is to use upper case for nonterminals and lower case for terminals.

The concrete syntax also has a number of forms for classes of terminals:
* A <code>.</code> (period) is a parsing expression matching any single terminal.
* Brackets around a list of characters <code>[abcde]</code> form a parsing expression matching one of the numerated characters. As in [[regular expression]]s, these classes may also include ranges <code>[0-9A-Za-z]</code> written as a hyphen with the range endpoints before and after it. (Unlike the case in regular expressions, bracket character classes do not have <code>^</code> for negation; that end can instead be had via not-predicates.)
* Some dialects have further notation for predefined classes of characters, such as letters, digits, punctuation marks, or spaces; this is again similar to the situation in regular expressions.
In abstract syntax, such forms are usually formalised as nonterminals whose exact definition is elided for brevity; in Unicode, there are tens of thousands of characters that are letters. Conversely, theoretical discussions sometimes introduce atomic abstract syntax for concepts that can alternatively be expressed using composite parsing expressions. Examples of this include:
* the empty string ε (as a parsing expression, it matches every string and consumes no characters),
* end of input ''E'' (concrete syntax equivalent is <code>!.</code>), and
* failure <math>\bot</math> (matches nothing).

In the concrete syntax, quoted and bracketed terminals have backslash escapes, so that "[[line feed]] or [[carriage return]]" may be written <code>[\n\r]</code>. The abstract syntax counterpart of a quoted terminal of length greater than one would be the sequence of those terminals; <code>"bar"</code> is the same as <code>"b" "a" "r"</code>. The primary concrete syntax assigns no distinct meaning to terminals depending on whether they use single or double quotes, but some dialects treat one as case-sensitive and the other as case-insensitive.

==== Composite parsing expressions ====
Given any existing parsing expressions ''e'', ''e''<sub>1</sub>, and ''e''<sub>2</sub>, a new parsing expression can be constructed using the following operators:
* ''Sequence'': ''e''<sub>1</sub> ''e''<sub>2</sub>
* ''Ordered choice'': ''e''<sub>1</sub> / ''e''<sub>2</sub>
* ''Zero-or-more'': ''e''*
* ''One-or-more'': ''e''+
* ''Optional'': ''e''?
* ''And-predicate'': &''e''
* ''Not-predicate'': !''e''
* ''Group'': (''e'')
Operator priorities are as follows, based on Table 1 in:<ref name="For04" />
   {| class="wikitable"
! Operator !! Priority
|-
| (''e'') || 5
|-
| ''e''*, ''e''+, ''e''? || 4
|-
| &''e'', !''e'' || 3
|-
| ''e''<sub>1</sub> ''e''<sub>2</sub> || 2
|-
| ''e''<sub>1</sub> / ''e''<sub>2</sub> || 1
|}

==== Grammars ====
In the concrete syntax, a parsing expression grammar is simply a sequence of nonterminal definitions, each of which has the form
<syntaxhighlight lang="peg">
  Identifier LEFTARROW Expression
</syntaxhighlight>
The <code>Identifier</code> is the nonterminal being defined, and the <code>Expression</code> is the parsing expression it is defined as referencing. The <code>LEFTARROW</code> varies a bit between dialects, but is generally some left-pointing arrow or assignment symbol, such as <code><-</code>, <code>←</code>, <code>:=</code>, or <code>=</code>. One way to understand it is precisely as making an assignment or definition of the nonterminal. Another way to understand it is as a contrast to the right-pointing arrow → used in the rules of a [[context-free grammar]]; with parsing expressions the flow of information goes from expression to nonterminal, not nonterminal to expression.

As a mathematical object, a parsing expression grammar is a tuple <math>(N,\Sigma,P,e_S)</math>, where <math>N</math> is the set of nonterminal symbols, <math>\Sigma</math> is the set of terminal symbols, <math>P</math> is a [[Function (mathematics)|function]] from <math>N</math> to the set of parsing expressions on <math>N \cup \Sigma</math>, and <math>e_S</math> is the starting parsing expression. Some concrete syntax dialects give the starting expression explicitly,<ref name="ptKupries">{{cite web |last1=Kupries |first1=Andreas |title=pt::peg_language - PEG Language Tutorial |url=https://core.tcl-lang.org/tcllib/doc/tcllib-1-21/embedded/md/tcllib/files/modules/pt/pt_peg_language.md |website=Tcl Library Source Code |access-date=14 January 2024}}</ref> but the primary concrete syntax instead has the implicit rule that the first nonterminal defined is the starting expression.

It is worth noticing that the primary dialect of concrete syntax parsing expression grammars does not have an explicit definition terminator or separator between definitions, although it is customary to begin a new definition on a new line; the <code>LEFTARROW</code> of the next definition is sufficient for finding the boundary, if one adds the constraint that a nonterminal in an <code>Expression</code> must not be followed by a <code>LEFTARROW</code>. However, some dialects may allow an explicit terminator, or outright require<ref name="ptKupries"/> it.

=== Example ===
This is a PEG that recognizes mathematical formulas that apply the basic five operations to non-negative integers.
<syntaxhighlight lang="peg">
Expr    ← Sum
Sum     ← Product (('+' / '-') Product)*
Product ← Power (('*' / '/') Power)*
Power   ← Value ('^' Power)?
Value   ← [0-9]+ / '(' Expr ')'
</syntaxhighlight>
In the above example, the terminal symbols are characters of text, represented by characters in single quotes, such as <code>'('</code> and <code>')'</code>. The range <code>[0-9]</code> is a shortcut for the ten characters from <code>'0'</code> to <code>'9'</code>. (This range syntax is the same as the syntax used by [[regular expression]]s.) The nonterminal symbols are the ones that expand to other rules: ''Value'', ''Power'', ''Product'', ''Sum'', and ''Expr''. Note that rules ''Sum'' and ''Product'' don't lead to desired left-associativity of these operations (they don't deal with associativity at all, and it has to be handled in post-processing step after parsing), and the ''Power'' rule (by referring to itself on the right) results in desired right-associativity of exponent. Also note that a rule like {{code|Sum ← Sum (('+' / '-') Product)?|peg}} (with intention to achieve left-associativity) would cause infinite recursion, so it cannot be used in practice even though it can be expressed in the grammar.

=== Semantics ===

The fundamental difference between [[context-free grammars]] and parsing expression grammars is that the PEG's choice operator is ''ordered''. If the first alternative succeeds, the second alternative is ignored. Thus ordered choice is not [[commutativity|commutative]], unlike unordered choice as in context-free grammars. Ordered choice is analogous to [[cut (logic programming)|soft cut]] operators available in some [[logic programming]] languages.

The consequence is that if a CFG is transliterated directly to a PEG, any ambiguity in the former is resolved by deterministically picking one parse tree from the possible parses. By carefully choosing the order in which the grammar alternatives are specified, a programmer has a great deal of control over which parse tree is selected.

Parsing expression grammars also add the and- and not- [[syntactic predicate]]s. Because they can use an arbitrarily complex sub-expression to "look ahead" into the input string without actually consuming it, they provide a powerful syntactic [[Parsing#Lookahead|lookahead]] and disambiguation facility, in particular when reordering the alternatives cannot specify the exact parse tree desired.

=== Operational interpretation of parsing expressions ===
Each nonterminal in a parsing expression grammar essentially represents a parsing [[function (mathematics)|function]] in a [[recursive descent parser]], and the corresponding parsing expression represents the "code" comprising the function. Each parsing function conceptually takes an input string as its argument, and yields one of the following results:
* ''success'', in which the function may optionally move forward or ''consume'' one or more characters of the input string supplied to it, or
* ''failure'', in which case no input is consumed.

An atomic parsing expression consisting of a single '''terminal''' (i.e. literal) succeeds if the first character of the input string matches that terminal, and in that case consumes the input character; otherwise the expression yields a failure result. An atomic parsing expression consisting of the '''empty''' string always trivially succeeds without consuming any input.
 
An atomic parsing expression consisting of a '''nonterminal''' ''A'' represents a [[recursion|recursive]] call to the nonterminal-function ''A''. A nonterminal may succeed without actually consuming any input, and this is considered an outcome distinct from failure.

The '''sequence''' operator ''e''<sub>1</sub> ''e''<sub>2</sub> first invokes ''e''<sub>1</sub>, and if ''e''<sub>1</sub> succeeds, subsequently invokes ''e''<sub>2</sub> on the remainder of the input string left unconsumed by ''e''<sub>1</sub>, and returns the result. If either ''e''<sub>1</sub> or ''e''<sub>2</sub> fails, then the sequence expression ''e''<sub>1</sub> ''e''<sub>2</sub> fails (consuming no input).

The '''choice''' operator ''e''<sub>1</sub> / ''e''<sub>2</sub> first invokes ''e''<sub>1</sub>, and if ''e''<sub>1</sub> succeeds, returns its result immediately. Otherwise, if ''e''<sub>1</sub> fails, then the choice operator [[backtracking|backtracks]] to the original input position at which it invoked ''e''<sub>1</sub>, but then calls ''e''<sub>2</sub> instead, returning ''e''<sub>2</sub>'s result.

The '''zero-or-more''', '''one-or-more''', and '''optional''' operators consume zero or more, one or more, or zero or one consecutive repetitions of their sub-expression ''e'', respectively. Unlike in [[context-free grammar]]s and [[regular expression]]s, however, these operators ''always'' behave [[greedy algorithm|greedily]], consuming as much input as possible and never backtracking. (Regular expression matchers may start by matching greedily, but will then backtrack and try shorter matches if they fail to match.) For example, the expression a* will always consume as many a's as are consecutively available in the input string, and the expression (a* a) will always fail because the first part (a*) will never leave any a's for the second part to match.

The '''and-predicate''' expression &''e'' invokes the sub-expression ''e'', and then succeeds if ''e'' succeeds and fails if ''e'' fails, but in either case ''never consumes any input''.

The '''not-predicate''' expression !''e'' succeeds if ''e'' fails and fails if ''e'' succeeds, again consuming no input in either case.

=== More examples ===
The following recursive rule matches standard C-style if/then/else statements in such a way that the optional "else" clause always binds to the innermost "if", because of the implicit prioritization of the '/' operator. (In a [[context-free grammar]], this construct yields the classic [[dangling else|dangling else ambiguity]].)
<syntaxhighlight lang="peg">
S ← 'if' C 'then' S 'else' S / 'if' C 'then' S
</syntaxhighlight>

The following recursive rule matches Pascal-style nested comment syntax, {{code|(* which can (* nest *) like this *)}}. Recall that {{code|.|peg}} matches any single character.
<syntaxhighlight lang="peg">
C     ← Begin N* End
Begin ← '(*'
End   ← '*)'
N     ← C / (!Begin !End .)
</syntaxhighlight>

The parsing expression {{code|2=peg|foo &(bar)}} matches and consumes the text "foo" but only if it is followed by the text "bar". The parsing expression {{code|2=peg|foo !(bar)}} matches the text "foo" but only if it is ''not'' followed by the text "bar".  The expression {{code|2=peg|!(a+ b) a}} matches a single "a" but only if it is not part of an arbitrarily long sequence of a's followed by a b.

The parsing expression {{code|2=peg|('a'/'b')*}} matches and consumes an arbitrary-length sequence of a's and b's. The [[Formal grammar#The syntax of grammars|production rule]] {{code|2=peg|S ← 'a' ''S''? 'b'}} describes the simple [[context-free language|context-free]] "matching language" <math> \{ a^n b^n : n \ge 1 \} </math>.
The following parsing expression grammar describes the classic non-context-free language <math> \{ a^n b^n c^n : n \ge 1 \} </math>:
<syntaxhighlight lang="peg">
S ← &(A 'c') 'a'+ B !.
A ← 'a' A? 'b'
B ← 'b' B? 'c'
</syntaxhighlight>