Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Parsing expression grammar
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Syntax === Both ''abstract'' and ''concrete'' syntaxes of parsing expressions are seen in the literature, and in this article. The abstract syntax is essentially a [[expression (mathematics)|mathematical formula]] and primarily used in theoretical contexts, whereas concrete syntax parsing expressions could be used directly to control a [[parser]]. The primary concrete syntax is that defined by Ford,<ref name="For04"/>{{rp|Fig.1}} although many tools have their own dialect of this. Other tools<ref>{{cite web |last1=Sirthias |first1=Mathias |title=Parboiled: Rule Construction in Java |website=[[GitHub]] |url=https://github.com/sirthias/parboiled/wiki/Rule-Construction-in-Java |access-date=13 January 2024}}</ref> can be closer to using a programming-language native encoding of abstract syntax parsing expressions as their concrete syntax. ==== Atomic parsing expressions ==== The two main kinds of parsing expressions not containing another parsing expression are individual terminal symbols and nonterminal symbols. In concrete syntax, terminals are placed inside quotes (single or double), whereas identifiers not in quotes denote nonterminals: <syntaxhighlight lang="peg"> "terminal" Nonterminal 'another terminal' </syntaxhighlight> In the abstract syntax there is no formalised distinction, instead each symbol is supposedly defined as either terminal or nonterminal, but a common convention is to use upper case for nonterminals and lower case for terminals. The concrete syntax also has a number of forms for classes of terminals: * A <code>.</code> (period) is a parsing expression matching any single terminal. * Brackets around a list of characters <code>[abcde]</code> form a parsing expression matching one of the numerated characters. As in [[regular expression]]s, these classes may also include ranges <code>[0-9A-Za-z]</code> written as a hyphen with the range endpoints before and after it. (Unlike the case in regular expressions, bracket character classes do not have <code>^</code> for negation; that end can instead be had via not-predicates.) * Some dialects have further notation for predefined classes of characters, such as letters, digits, punctuation marks, or spaces; this is again similar to the situation in regular expressions. In abstract syntax, such forms are usually formalised as nonterminals whose exact definition is elided for brevity; in Unicode, there are tens of thousands of characters that are letters. Conversely, theoretical discussions sometimes introduce atomic abstract syntax for concepts that can alternatively be expressed using composite parsing expressions. Examples of this include: * the empty string Ξ΅ (as a parsing expression, it matches every string and consumes no characters), * end of input ''E'' (concrete syntax equivalent is <code>!.</code>), and * failure <math>\bot</math> (matches nothing). In the concrete syntax, quoted and bracketed terminals have backslash escapes, so that "[[line feed]] or [[carriage return]]" may be written <code>[\n\r]</code>. The abstract syntax counterpart of a quoted terminal of length greater than one would be the sequence of those terminals; <code>"bar"</code> is the same as <code>"b" "a" "r"</code>. The primary concrete syntax assigns no distinct meaning to terminals depending on whether they use single or double quotes, but some dialects treat one as case-sensitive and the other as case-insensitive. ==== Composite parsing expressions ==== Given any existing parsing expressions ''e'', ''e''<sub>1</sub>, and ''e''<sub>2</sub>, a new parsing expression can be constructed using the following operators: * ''Sequence'': ''e''<sub>1</sub> ''e''<sub>2</sub> * ''Ordered choice'': ''e''<sub>1</sub> / ''e''<sub>2</sub> * ''Zero-or-more'': ''e''* * ''One-or-more'': ''e''+ * ''Optional'': ''e''? * ''And-predicate'': &''e'' * ''Not-predicate'': !''e'' * ''Group'': (''e'') Operator priorities are as follows, based on Table 1 in:<ref name="For04" /> {| class="wikitable" ! Operator !! Priority |- | (''e'') || 5 |- | ''e''*, ''e''+, ''e''? || 4 |- | &''e'', !''e'' || 3 |- | ''e''<sub>1</sub> ''e''<sub>2</sub> || 2 |- | ''e''<sub>1</sub> / ''e''<sub>2</sub> || 1 |} ==== Grammars ==== In the concrete syntax, a parsing expression grammar is simply a sequence of nonterminal definitions, each of which has the form <syntaxhighlight lang="peg"> Identifier LEFTARROW Expression </syntaxhighlight> The <code>Identifier</code> is the nonterminal being defined, and the <code>Expression</code> is the parsing expression it is defined as referencing. The <code>LEFTARROW</code> varies a bit between dialects, but is generally some left-pointing arrow or assignment symbol, such as <code><-</code>, <code>β</code>, <code>:=</code>, or <code>=</code>. One way to understand it is precisely as making an assignment or definition of the nonterminal. Another way to understand it is as a contrast to the right-pointing arrow β used in the rules of a [[context-free grammar]]; with parsing expressions the flow of information goes from expression to nonterminal, not nonterminal to expression. As a mathematical object, a parsing expression grammar is a tuple <math>(N,\Sigma,P,e_S)</math>, where <math>N</math> is the set of nonterminal symbols, <math>\Sigma</math> is the set of terminal symbols, <math>P</math> is a [[Function (mathematics)|function]] from <math>N</math> to the set of parsing expressions on <math>N \cup \Sigma</math>, and <math>e_S</math> is the starting parsing expression. Some concrete syntax dialects give the starting expression explicitly,<ref name="ptKupries">{{cite web |last1=Kupries |first1=Andreas |title=pt::peg_language - PEG Language Tutorial |url=https://core.tcl-lang.org/tcllib/doc/tcllib-1-21/embedded/md/tcllib/files/modules/pt/pt_peg_language.md |website=Tcl Library Source Code |access-date=14 January 2024}}</ref> but the primary concrete syntax instead has the implicit rule that the first nonterminal defined is the starting expression. It is worth noticing that the primary dialect of concrete syntax parsing expression grammars does not have an explicit definition terminator or separator between definitions, although it is customary to begin a new definition on a new line; the <code>LEFTARROW</code> of the next definition is sufficient for finding the boundary, if one adds the constraint that a nonterminal in an <code>Expression</code> must not be followed by a <code>LEFTARROW</code>. However, some dialects may allow an explicit terminator, or outright require<ref name="ptKupries"/> it.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)