Editing Lexical analysis (section)

== <span class="anchor" id="Tokenization"></span><span class="anchor" id="Token"></span>Lexical token and lexical tokenization==
{{Distinguish|Large language model#Tokenization|tokenization (data security)}}
<!--[[Lexical token]] and [[Token (parser)]] and [[Tokenize]] and [[Tokenizing]] redirect here ([[MOS:HEAD]])-->

A ''lexical token'' is a [[String (computer science)|string]] with an assigned and thus identified meaning, in contrast to the probabilistic token used in [[large language model]]s. A lexical token consists of a ''token name'' and an optional ''token value''. The token name is a category of a rule-based lexical unit.<ref name="auto">page 111, "Compilers Principles, Techniques, & Tools, 2nd Ed." (WorldCat) by Aho, Lam, Sethi and Ullman, as quoted in https://stackoverflow.com/questions/14954721/what-is-the-difference-between-token-and-lexeme</ref>

{|class="wikitable"
|+ Examples of common tokens
! Token name
(Lexical category)
! Explanation !! Sample token values
|-
| [[Identifier (computer languages)|identifier]] || Names assigned by the programmer. || {{code|x}}, {{code|color}}, {{code|UP}}
|-
| [[Reserved word|keyword]] || Reserved words of the language. || {{code|2=c|if}}, {{code|2=c|while}}, {{code|2=c|return}}
|-
| [[delimiter|separator/punctuator]] || Punctuation characters and paired delimiters. || <code>}</code>, <code>(</code>, <code>;</code>
|-
| [[Operator (computer programming)|operator]] || Symbols that operate on arguments and produce results. || {{code|2=c|1=+}}, {{code|2=c|1=<}}, {{code|2=c|1==}}
|-
| [[Literal (computer programming)|literal]] || Numeric, logical, textual, and reference literals. || {{code|2=c|true}}, {{code|2=c|6.02e23}}, {{code|2=c|"music"}}
|-
| [[Comment (computer programming)|comment]] || Line or block comments. Usually discarded. || {{code|2=c|/* Retrieves user data */}}, {{code|2=c|// must be negative}}
|-
| [[Whitespace character|whitespace]]   || Groups of non-printable characters. Usually discarded. || –
|}

Consider this expression in the [[C (programming language)|C]] programming language:
: {{code|2=c|1=x = a + b * 2;}}

The lexical analysis of this expression yields the following sequence of tokens:
: <code>[(identifier, x), (operator, =), (identifier, a), (operator, +), (identifier, b), (operator, *), (literal, 2), (separator, ;)]</code>

A token name is what might be termed a [[part of speech]] in linguistics.

''Lexical tokenization'' is the conversion of a raw text into (semantically or syntactically) meaningful lexical tokens, belonging to categories defined by a "lexer" program, such as identifiers, operators, grouping symbols, and data types. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of [[parsing]] input.

For example, in the text [[String (computer science)|string]]:
: <code>The quick brown fox jumps over the lazy dog</code>

the string is not implicitly segmented on spaces, as a [[natural language]] speaker would do. The raw input, the 43 characters, must be explicitly split into the 9 tokens with a given space delimiter (i.e., matching the string <code>" "</code> or [[regular expression]] <code>/\s{1}/</code>).

When a token class represents more than one possible lexeme, the lexer often saves enough information to reproduce the original lexeme, so that it can be used in [[Semantic analysis (compilers)|semantic analysis]]. The parser typically retrieves this information from the lexer and stores it in the [[abstract syntax tree]]. This is necessary in order to avoid information loss in the case where numbers may also be valid identifiers.

Tokens are identified based on the specific rules of the lexer. Some methods used to identify tokens include [[regular expression]]s, specific sequences of characters termed a [[Flag (computing)|flag]], specific separating characters called [[delimiter]]s, and explicit definition by a dictionary. Special characters, including punctuation characters, are commonly used by lexers to identify tokens because of their natural use in written and programming languages. A lexical analyzer generally does nothing with combinations of tokens, a task left for a [[parser]]. For example, a typical lexical analyzer recognizes parentheses as tokens but does nothing to ensure that each "(" is matched with a ")". 

When a lexer feeds tokens to the parser, the representation used is typically an [[enumerated type]], which is a list of number representations. For example, "Identifier" can be represented with 0, "Assignment operator" with 1, "Addition operator" with 2, etc.

Tokens are often defined by [[regular expression]]s, which are understood by a lexical analyzer generator such as [[lex (software)|lex]], or handcoded equivalent [[finite-state automata]]. The lexical analyzer (generated automatically by a tool like lex or hand-crafted) reads in a stream of characters, identifies the [[#Lexeme|lexemes]] in the stream, and categorizes them into tokens. This is termed ''tokenizing''. If the lexer finds an invalid token, it will report an error.

Following tokenizing is [[parsing]]. From there, the interpreted data may be loaded into data structures for general use, interpretation, or [[compiling]].