Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Lexical analysis
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Details == === Scanner === The first stage, the ''scanner'', is usually based on a [[finite-state machine]] (FSM). It has encoded within it information on the possible sequences of characters that can be contained within any of the tokens it handles (individual instances of these character sequences are termed [[#Lexeme|lexemes]]). For example, an ''integer'' lexeme may contain any sequence of [[numerical digit]] characters. In many cases, the first non-whitespace character can be used to deduce the kind of token that follows and subsequent input characters are then processed one at a time until reaching a character that is not in the set of characters acceptable for that token (this is termed the ''[[maximal munch]]'', or ''longest match'', rule). In some languages, the lexeme creation rules are more complex and may involve [[backtracking]] over previously read characters. For example, in C, one 'L' character is not enough to distinguish between an identifier that begins with 'L' and a wide-character string literal. === Evaluator === A [[#Lexeme|lexeme]], however, is only a string of characters known to be of a certain kind (e.g., a string literal, a sequence of letters). In order to construct a token, the lexical analyzer needs a second stage, the ''evaluator'', which goes over the characters of the lexeme to produce a ''value''. The lexeme's type combined with its value is what properly constitutes a token, which can be given to a parser. Some tokens such as parentheses do not really have values, and so the evaluator function for these can return nothing: Only the type is needed. Similarly, sometimes evaluators can suppress a lexeme entirely, concealing it from the parser, which is useful for whitespace and comments. The evaluators for identifiers are usually simple (literally representing the identifier), but may include some [[stropping (syntax)|unstropping]]. The evaluators for [[integer literal]]s may pass the string on (deferring evaluation to the semantic analysis phase), or may perform evaluation themselves, which can be involved for different bases or floating point numbers. For a simple quoted string literal, the evaluator needs to remove only the quotes, but the evaluator for an [[String literal#Escape sequences|escaped string literal]] incorporates a lexer, which unescapes the escape sequences. For example, in the source code of a computer program, the string : {{code|2=c|1=net_worth_future = (assets β liabilities);}} might be converted into the following lexical token stream; whitespace is suppressed and special characters have no value: IDENTIFIER net_worth_future EQUALS OPEN_PARENTHESIS IDENTIFIER assets MINUS IDENTIFIER liabilities CLOSE_PARENTHESIS SEMICOLON Lexers may be written by hand. This is practical if the list of tokens is small, but lexers generated by automated tooling as part of a [[compiler-compiler]] [[toolchain]] are more practical for a larger number of potential tokens. These tools generally accept regular expressions that describe the tokens allowed in the input stream. Each regular expression is associated with a [[Formal grammar#The syntax of grammars|production rule]] in the lexical grammar of the programming language that evaluates the lexemes matching the regular expression. These tools may generate source code that can be compiled and executed or construct a [[state transition table]] for a [[finite-state machine]] (which is plugged into template code for compiling and executing). Regular expressions compactly represent patterns that the characters in lexemes might follow. For example, for an [[English language|English]]-based language, an IDENTIFIER token might be any English alphabetic character or an underscore, followed by any number of instances of ASCII alphanumeric characters and/or underscores. This could be represented compactly by the string {{code|[a-zA-Z_][a-zA-Z_0-9]*}}. This means "any character a-z, A-Z or _, followed by 0 or more of a-z, A-Z, _ or 0-9". Regular expressions and the finite-state machines they generate are not powerful enough to handle recursive patterns, such as "''n'' opening parentheses, followed by a statement, followed by ''n'' closing parentheses." They are unable to keep count, and verify that ''n'' is the same on both sides, unless a finite set of permissible values exists for ''n''. It takes a full parser to recognize such patterns in their full generality. A parser can push parentheses on a stack and then try to pop them off and see if the stack is empty at the end (see example<ref>{{cite web|url=http://mitpress.mit.edu/sicp/full-text/book/book-Z-H-31.html#%25_sec_5.1.4|title=Structure and Interpretation of Computer Programs|website=mitpress.mit.edu|access-date=2009-03-07|archive-url=https://web.archive.org/web/20121030233934/http://mitpress.mit.edu/sicp/full-text/book/book-Z-H-31.html#%25_sec_5.1.4|archive-date=2012-10-30|url-status=dead}}</ref> in the ''[[Structure and Interpretation of Computer Programs]]'' book).
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)