Editing Lexical analysis (section)

{{redirect|Lexer|people with this name|Lexer (surname)}}
{{short description|Conversion of character sequences into token sequences in computer science}}

'''Lexical tokenization''' is conversion of a text into (semantically or syntactically) meaningful ''lexical tokens'' belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc. In case of a programming language, the categories include [[Identifier (computer languages)|identifiers]], [[Operator (computer programming)|operators]], [[Symbols of grouping|grouping symbols]], [[data type]]s and language keywords. Lexical tokenization is related to the type of tokenization used in [[large language model]]s (LLMs) but with two differences. First, lexical tokenization is usually based on a [[lexical grammar]], whereas LLM tokenizers are usually [[probability]]-based. Second, LLM tokenizers perform a second step that converts the tokens into numerical values.