Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Parsing
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Computer languages == {{unreferenced section|date=February 2013}} === Parser === {{Redirect|Parser|the scripting language named Parser|Parser (programming language)}} A '''parser''' is a software component that takes input data (typically text) and builds a [[data structure]] β often some kind of [[parse tree]], [[abstract syntax tree]] or other hierarchical structure, giving a structural representation of the input while checking for correct syntax. The parsing may be preceded or followed by other steps, or these may be combined into a single step. The parser is often preceded by a separate [[Lexical analysis|lexical analyser]], which creates tokens from the sequence of input characters; alternatively, these can be combined in [[scannerless parsing]]. Parsers may be programmed by hand or may be automatically or semi-automatically generated by a [[parser generator]]. Parsing is complementary to [[templating language|templating]], which produces formatted ''output.'' These may be applied to different domains, but often appear together, such as the [[scanf]]/[[printf]] pair, or the input (front end parsing) and output (back end code generation) stages of a [[compiler]]. The input to a parser is typically text in some [[computer language]], but may also be text in a natural language or less structured textual data, in which case generally only certain parts of the text are extracted, rather than a parse tree being constructed. Parsers range from very simple functions such as [[scanf]], to complex programs such as the frontend of a [[C++ compiler]] or the [[HTML]] parser of a [[web browser]]. An important class of simple parsing is done using [[regular expression]]s, in which a group of regular expressions defines a [[regular language]] and a regular expression engine automatically generating a parser for that language, allowing [[pattern matching]] and extraction of text. In other contexts regular expressions are instead used prior to parsing, as the lexing step whose output is then used by the parser. The use of parsers varies by input. In the case of data languages, a parser is often found as the file reading facility of a program, such as reading in HTML or [[XML]] text; these examples are [[markup language]]s. In the case of [[programming language]]s, a parser is a component of a [[compiler]] or [[Interpreter (computing)|interpreter]], which parses the [[source code]] of a [[computer programming language]] to create some form of internal representation; the parser is a key step in the [[compiler frontend]]. Programming languages tend to be specified in terms of a [[deterministic context-free grammar]] because fast and efficient parsers can be written for them. For compilers, the parsing itself can be done in one pass or multiple passes β see [[one-pass compiler]] and [[multi-pass compiler]]. The implied disadvantages of a one-pass compiler can largely be overcome by adding [[Relocation (computing)|fix-ups]], where provision is made for code relocation during the forward pass, and the fix-ups are applied backwards when the current program segment has been recognized as having been completed. An example where such a fix-up mechanism would be useful would be a forward GOTO statement, where the target of the GOTO is unknown until the program segment is completed. In this case, the application of the fix-up would be delayed until the target of the GOTO was recognized. Conversely, a backward GOTO does not require a fix-up, as the location will already be known. Context-free grammars are limited in the extent to which they can express all of the requirements of a language. Informally, the reason is that the memory of such a language is limited. The grammar cannot remember the presence of a construct over an arbitrarily long input; this is necessary for a language in which, for example, a name must be declared before it may be referenced. More powerful grammars that can express this constraint, however, cannot be parsed efficiently. Thus, it is a common strategy to create a relaxed parser for a context-free grammar which accepts a superset of the desired language constructs (that is, it accepts some invalid constructs); later, the unwanted constructs can be filtered out at the [[Semantic analysis (compilers)|semantic analysis]] (contextual analysis) step. For example, in [[Python (programming language)|Python]] the following is syntactically valid code: <syntaxhighlight lang="python"> x = 1 print(x) </syntaxhighlight> The following code, however, is syntactically valid in terms of the context-free grammar, yielding a syntax tree with the same structure as the previous, but violates the semantic rule requiring variables to be initialized before use: <syntaxhighlight lang="python"> x = 1 print(y) </syntaxhighlight> === Overview of process === [[File:Parser FlowΥΈ.gif|right|Flow of data in a typical parser]] The following example demonstrates the common case of parsing a computer language with two levels of grammar: lexical and syntactic. The first stage is the token generation, or [[lexical analysis]], by which the input character stream is split into meaningful symbols defined by a grammar of [[regular expression]]s. For example, a calculator program would look at an input such as "<code>12 * (3 + 4)^2</code>" and split it into the tokens <code>12</code>, <code>*</code>, <code>(</code>, <code>3</code>, <code>+</code>, <code>4</code>, <code>)</code>, <code>^</code>, <code>2</code>, each of which is a meaningful symbol in the context of an arithmetic expression. The lexer would contain rules to tell it that the characters <code>*</code>, <code>+</code>, <code>^</code>, <code>(</code> and <code>)</code> mark the start of a new token, so meaningless tokens like "<code>12*</code>" or "<code>(3</code>" will not be generated. The next stage is parsing or syntactic analysis, which is checking that the tokens form an allowable expression. This is usually done with reference to a [[context-free grammar]] which recursively defines components that can make up an expression and the order in which they must appear. However, not all rules defining programming languages can be expressed by context-free grammars alone, for example type validity and proper declaration of identifiers. These rules can be formally expressed with [[attribute grammar]]s. The final phase is [[Semantic analysis (computer science)|semantic parsing]] or analysis, which is working out the implications of the expression just validated and taking the appropriate action.<ref>Berant, Jonathan, and Percy Liang. "[https://www.aclweb.org/anthology/P14-1133.pdf Semantic parsing via paraphrasing]." Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2014.</ref> In the case of a calculator or interpreter, the action is to evaluate the expression or program; a compiler, on the other hand, would generate some kind of code. Attribute grammars can also be used to define these actions.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)