Editing Parsing expression grammar (section)

=== Compared to context-free grammars ===

PEGs can comfortably be given in terms of characters, whereas [[context-free grammar]]s (CFGs) are usually given in terms of tokens, thus requiring an extra step of tokenisation in front of parsing proper.<ref>CFGs ''can'' be used to describe the syntax of common programming languages down to the character level, but doing so is rather cumbersome, because the standard tokenisation rule that a token consists of the longest consecutive sequence of characters of the same kind does not mesh well with the nondeterministic side of CFGs. To formalise that whitespace between two adjacent tokens is mandatory if the characters on both sides of the token boundary are letters, but optional if they are non-letters, a CFG needs multiple variants of most nonterminals, to keep track of what kind of character has to be at the boundary. If there are <math>3</math> different kinds of non-whitespace characters, that adds up to <math>3^2 = 9</math> possible variants per nonterminal — significantly bloating the grammar.</ref> An advantage of not having a separate tokeniser is that different parts of the language (for example embedded [[Domain-specific language|mini-language]]s) can easily have different tokenisation rules.

In the strict formal sense, PEGs are likely incomparable to CFGs, but practically there are many things that PEGs can do which pure CFGs cannot, whereas it is difficult to come up with examples of the contrary. In particular PEGs can be crafted to natively resolve ambiguities, such as the "[[dangling else]]" problem in C, C++, and Java, whereas CFG-based parsing often needs a rule outside of the grammar to resolve them. Moreover any PEG can be parsed in linear time by using a packrat parser, as described above, whereas parsing according to a general CFG is asymptotically equivalent<ref>{{cite journal |last1=Lee |first1=Lillian |title=Fast Context-free Grammar Parsing Requires Fast Boolean Matrix Multiplication |journal=J. ACM |date=January 2002 |volume=49 |issue=1 |page=1–15 |doi=10.1145/505241.505242|arxiv=cs/0112018 }}</ref> to [[logical matrix|boolean matrix]] multiplication (thus likely between quadratic and cubic time).

One classical example of a formal language which is provably not context-free is the language <math> \{a^n b^n c^n\}_{n \geqslant 0} </math>: an arbitrary number of <math>a</math>s are followed by an equal number of <math>b</math>s, which in turn are followed by an equal number of <math>c</math>s. This, too, is a parsing expression language, matched by the grammar

<syntaxhighlight lang="peg">
start ← AB 'c'*
AB    ← 'a' AB 'b' / &(BC !.)
BC    ← ('b' BC 'c')?
</syntaxhighlight>

For <code>AB</code> to match, the first stretch of <math>a</math>s must be followed by an equal number of <math>b</math>s, and in addition <code>BC</code> has to match where the <math>a</math>s switch to <math>b</math>s, which means those <math>b</math>s are followed by an equal number of <math>c</math>s.