Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Parsing expression grammar
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Advantages == === No compilation required === Many parsing algorithms require a preprocessing step where the grammar is first compiled into an opaque executable form, often some sort of automaton. Parsing expressions can be executed directly (even if it is typically still advisable to transform the human-readable PEGs shown in this article into a more native format, such as [[Lisp (programming language)#Symbolic_expressions_(S-expressions)|S-expressions]], before evaluating them). === Compared to regular expressions === Compared to pure [[regular expressions]] (i.e., describing a language recognisable using a [[finite automaton]]), PEGs are vastly more powerful. In particular they can handle unbounded recursion, and so match parentheses down to an arbitrary nesting depth; regular expressions can at best keep track of nesting down to some fixed depth, because a finite automaton (having a finite set of internal states) can only distinguish finitely many different nesting depths. In more theoretical terms, <math> \{a^n b^n\}_{n \geqslant 0} </math> (the language of all strings of zero or more <math>a</math>'s, followed by an ''equal number'' of <math>b</math>s) is not a regular language, but it is easily seen to be a parsing expression language, matched by the grammar <syntaxhighlight lang="peg"> start β AB !. AB β ('a' AB 'b')? </syntaxhighlight> Here <code>AB !.</code> is the starting expression. The <code>!.</code> part enforces that the input ends after the <code>AB</code>, by saying βthere is no next characterβ; unlike regular expressions, which have magic constraints <code>$</code> or <code>\Z</code> for this, parsing expressions can express the end of input using only the basic primitives. The <code>*</code>, <code>+</code>, and <code>?</code> of parsing expressions are similar to those in regular expressions, but a difference is that these operate strictly in a greedy mode. This is ultimately due to <code>/</code> being an ordered choice. A consequence is that something can match as a regular expression which does not match as parsing expression: : <code>[ab]?[bc][cd]</code> is both a valid regular expression and a valid parsing expression. As regular expression, it matches <code>bc</code>, but as parsing expression it does not match, because the <code>[ab]?</code> will match the <code>b</code>, then <code>[bc]</code> will match the <code>c</code>, leaving nothing for the <code>[cd]</code>, so at that point matching the sequence fails. "Trying again" with having <code>[ab]?</code> match the empty string is explicitly against the semantics of parsing expressions; this is not an edge case of a particular matching algorithm, instead it is the sought behaviour. Even regular expressions that depend on nondeterminism ''can'' be compiled into a parsing expression grammar, by having a separate nonterminal for every state of the corresponding [[deterministic finite automaton|DFA]] and encoding its transition function into the definitions of these nonterminals β <syntaxhighlight lang="peg"> A β 'x' B / 'y' C </syntaxhighlight> is effectively saying "from state A transition to state B if the next character is x, but to state C if the next character is y" β but this works because nondeterminism can be eliminated within the realm of regular languages. It would not make use of the parsing expression variants of the repetition operations. === Compared to context-free grammars === PEGs can comfortably be given in terms of characters, whereas [[context-free grammar]]s (CFGs) are usually given in terms of tokens, thus requiring an extra step of tokenisation in front of parsing proper.<ref>CFGs ''can'' be used to describe the syntax of common programming languages down to the character level, but doing so is rather cumbersome, because the standard tokenisation rule that a token consists of the longest consecutive sequence of characters of the same kind does not mesh well with the nondeterministic side of CFGs. To formalise that whitespace between two adjacent tokens is mandatory if the characters on both sides of the token boundary are letters, but optional if they are non-letters, a CFG needs multiple variants of most nonterminals, to keep track of what kind of character has to be at the boundary. If there are <math>3</math> different kinds of non-whitespace characters, that adds up to <math>3^2 = 9</math> possible variants per nonterminal β significantly bloating the grammar.</ref> An advantage of not having a separate tokeniser is that different parts of the language (for example embedded [[Domain-specific language|mini-language]]s) can easily have different tokenisation rules. In the strict formal sense, PEGs are likely incomparable to CFGs, but practically there are many things that PEGs can do which pure CFGs cannot, whereas it is difficult to come up with examples of the contrary. In particular PEGs can be crafted to natively resolve ambiguities, such as the "[[dangling else]]" problem in C, C++, and Java, whereas CFG-based parsing often needs a rule outside of the grammar to resolve them. Moreover any PEG can be parsed in linear time by using a packrat parser, as described above, whereas parsing according to a general CFG is asymptotically equivalent<ref>{{cite journal |last1=Lee |first1=Lillian |title=Fast Context-free Grammar Parsing Requires Fast Boolean Matrix Multiplication |journal=J. ACM |date=January 2002 |volume=49 |issue=1 |page=1β15 |doi=10.1145/505241.505242|arxiv=cs/0112018 }}</ref> to [[logical matrix|boolean matrix]] multiplication (thus likely between quadratic and cubic time). One classical example of a formal language which is provably not context-free is the language <math> \{a^n b^n c^n\}_{n \geqslant 0} </math>: an arbitrary number of <math>a</math>s are followed by an equal number of <math>b</math>s, which in turn are followed by an equal number of <math>c</math>s. This, too, is a parsing expression language, matched by the grammar <syntaxhighlight lang="peg"> start β AB 'c'* AB β 'a' AB 'b' / &(BC !.) BC β ('b' BC 'c')? </syntaxhighlight> For <code>AB</code> to match, the first stretch of <math>a</math>s must be followed by an equal number of <math>b</math>s, and in addition <code>BC</code> has to match where the <math>a</math>s switch to <math>b</math>s, which means those <math>b</math>s are followed by an equal number of <math>c</math>s.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)