Editing Lexical analysis (section)

== Details ==
=== Scanner ===
The first stage, the ''scanner'', is usually based on a [[finite-state machine]] (FSM). It has encoded within it information on the possible sequences of characters that can be contained within any of the tokens it handles (individual instances of these character sequences are termed [[#Lexeme|lexemes]]). For example, an ''integer'' lexeme may contain any sequence of [[numerical digit]] characters. In many cases, the first non-whitespace character can be used to deduce the kind of token that follows and subsequent input characters are then processed one at a time until reaching a character that is not in the set of characters acceptable for that token (this is termed the ''[[maximal munch]]'', or ''longest match'', rule). In some languages, the lexeme creation rules are more complex and may involve [[backtracking]] over previously read characters. For example, in C, one 'L' character is not enough to distinguish between an identifier that begins with 'L' and a wide-character string literal.

=== Evaluator ===
A [[#Lexeme|lexeme]], however, is only a string of characters known to be of a certain kind (e.g., a string literal, a sequence of letters). In order to construct a token, the lexical analyzer needs a second stage, the ''evaluator'', which goes over the characters of the lexeme to produce a ''value''. The lexeme's type combined with its value is what properly constitutes a token, which can be given to a parser. Some tokens such as parentheses do not really have values, and so the evaluator function for these can return nothing: Only the type is needed. Similarly, sometimes evaluators can suppress a lexeme entirely, concealing it from the parser, which is useful for whitespace and comments. The evaluators for identifiers are usually simple (literally representing the identifier), but may include some [[stropping (syntax)|unstropping]]. The evaluators for [[integer literal]]s may pass the string on (deferring evaluation to the semantic analysis phase), or may perform evaluation themselves, which can be involved for different bases or floating point numbers. For a simple quoted string literal, the evaluator needs to remove only the quotes, but the evaluator for an [[String literal#Escape sequences|escaped string literal]] incorporates a lexer, which unescapes the escape sequences.

For example, in the source code of a computer program, the string
:  {{code|2=c|1=net_worth_future = (assets – liabilities);}}
might be converted into the following lexical token stream; whitespace is suppressed and special characters have no value:
 IDENTIFIER net_worth_future
 EQUALS
 OPEN_PARENTHESIS
 IDENTIFIER assets
 MINUS
 IDENTIFIER liabilities
 CLOSE_PARENTHESIS
 SEMICOLON

Lexers may be written by hand. This is practical if the list of tokens is small, but lexers generated by automated tooling as part of a [[compiler-compiler]] [[toolchain]] are more practical for a larger number of potential tokens. These tools generally accept regular expressions that describe the tokens allowed in the input stream. Each regular expression is associated with a [[Formal grammar#The syntax of grammars|production rule]] in the lexical grammar of the programming language that evaluates the lexemes matching the regular expression. These tools may generate source code that can be compiled and executed or construct a [[state transition table]] for a [[finite-state machine]] (which is plugged into template code for compiling and executing).

Regular expressions compactly represent patterns that the characters in lexemes might follow. For example, for an [[English language|English]]-based language, an IDENTIFIER token might be any English alphabetic character or an underscore, followed by any number of instances of ASCII alphanumeric characters and/or underscores. This could be represented compactly by the string {{code|[a-zA-Z_][a-zA-Z_0-9]*}}. This means "any character a-z, A-Z or _, followed by 0 or more of a-z, A-Z, _ or 0-9".

Regular expressions and the finite-state machines they generate are not powerful enough to handle recursive patterns, such as "''n'' opening parentheses, followed by a statement, followed by ''n'' closing parentheses." They are unable to keep count, and verify that ''n'' is the same on both sides, unless a finite set of permissible values exists for ''n''. It takes a full parser to recognize such patterns in their full generality. A parser can push parentheses on a stack and then try to pop them off and see if the stack is empty at the end (see example<ref>{{cite web|url=http://mitpress.mit.edu/sicp/full-text/book/book-Z-H-31.html#%25_sec_5.1.4|title=Structure and Interpretation of Computer Programs|website=mitpress.mit.edu|access-date=2009-03-07|archive-url=https://web.archive.org/web/20121030233934/http://mitpress.mit.edu/sicp/full-text/book/book-Z-H-31.html#%25_sec_5.1.4|archive-date=2012-10-30|url-status=dead}}</ref> in the ''[[Structure and Interpretation of Computer Programs]]'' book).