Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Parsing expression grammar
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Unexpected behaviour == A common first impression of PEGs is that they look like CFGs with certain convenience features β repetition operators <code>*+?</code> as in regular expressions and lookahead predicates <code>&!</code> β plus ordered choice for disambiguation. This understanding can be sufficient when one's goal is to create a parser for a language, but it is not sufficient for more theoretical discussions of the computational power of parsing expressions. In particular the [[Nondeterministic Turing machine|nondeterminism]] inherent in the unordered choice <code>|</code> of context-free grammars makes it very different from the deterministic ordered choice <code>/</code>. === The midpoint problem === PEG packrat parsers cannot recognize some unambiguous nondeterministic CFG rules, such as the following:<ref name="pegs"/> <syntaxhighlight lang="peg"> S β 'x' S 'x' | 'x' </syntaxhighlight> Neither [[LL(k)]] nor LR(k) parsing algorithms are capable of recognizing this example. However, this grammar can be used by a general CFG parser like the [[CYK algorithm]]. However, the ''language'' in question can be recognised by all these types of parser, since it is in fact a regular language (that of strings of an odd number of x's). It is instructive to work out exactly what a PEG parser does when attempting to match <syntaxhighlight lang="peg"> S β 'x' S 'x' / 'x' </syntaxhighlight> against the string <code>xxxxxq</code>. As expected, it recursively tries to match the nonterminal <code>S</code> at increasing positions in this string, until failing the match against the <code>q</code>, and after that begins to backtrack. This goes as follows: Position: 123456 String: xxxxxq Results: β Pos.6: Neither branch of S matches β Pos.5: First branch of S fails, second branch succeeds, yielding match of length 1. β Pos.4: First branch of S fails, second branch succeeds, yielding match of length 1. β Pos.3: First branch of S succeeds, yielding match of length 3. β Pos.2: First branch of S fails, because after the S match at 3 comes a q. Second branch succeeds, yielding match of length 1. β Pos.1: First branch of S succeeds, yielding match of length 3. Matching against a parsing expression is [[greedy algorithm|greedy]], in the sense that the first success encountered is the only one considered. Even if locally the choices are ordered longest first, there is no guarantee that this greedy matching will find the globally longest match. === Ambiguity detection and influence of rule order on language that is matched === LL(k) and LR(k) parser generators will fail to complete when the input grammar is ambiguous. This is a feature in the common case that the grammar is intended to be unambiguous but is defective. A PEG parser generator will resolve unintended ambiguities earliest-match-first, which may be arbitrary and lead to surprising parses. The ordering of productions in a PEG grammar affects not only the resolution of ambiguity, but also the ''language matched''. For example, consider the first PEG example in Ford's paper<ref name="For04" /> (example rewritten in pegjs.org/online notation, and labelled {{tmath|G_1}} and {{tmath|G_2}}): * {{tmath|G_1}}: {{code|code= A = "a" "b" / "a"}} * {{tmath|G_2}}: {{code|code= A = "a" / "a" "b"}} Ford notes that ''The second alternative in the latter PEG rule will never succeed because the first choice is always taken if the input string ... begins with 'a'.''.<ref name="For04" /> Specifically, {{tmath|L(G_1)}} (i.e., the language matched by {{tmath|G_1}}) includes the input "ab", but {{tmath|L(G_2)}} does not. Thus, adding a new option to a PEG grammar can ''remove'' strings from the language matched, e.g. {{tmath|G_2}} is the addition of a rule to the single-production grammar {{code|code= A = "a" "b"}}, which contains a string not matched by {{tmath|G_2}}. Furthermore, constructing a grammar to match {{tmath|L(G_1) \cup L(G_2)}} from PEG grammars {{tmath|G_1}} and {{tmath|G_2}} is not always a trivial task. This is in stark contrast to CFG's, in which the addition of a new production cannot remove strings (though, it can introduce problems in the form of ambiguity), and a (potentially ambiguous) grammar for {{tmath|L(G_1) \cup L(G_2)}} can be constructed <syntaxhighlight lang="apl"> S β start(G1) | start(G2) </syntaxhighlight>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)