Editing Lexical analysis (section)

== Obstacles ==
Typically, lexical tokenization occurs at the word level. However, it is sometimes difficult to define what is meant by a "word". Often, a tokenizer relies on simple heuristics, for example:
* Punctuation and whitespace may or may not be included in the resulting list of tokens.
* All contiguous strings of alphabetic characters are part of one token; likewise with numbers.
* Tokens are separated by [[whitespace character|whitespace]] characters, such as a space or line break, or by punctuation characters.

In languages that use inter-word spaces (such as most that use the Latin alphabet, and most programming languages), this approach is fairly straightforward. However, even here there are many edge cases such as [[Poetic contraction|contractions]], [[hyphen]]ated words, [[emoticon]]s, and larger constructs such as [[URI]]s (which for some purposes may count as single tokens). A classic example is "New York-based", which a naive tokenizer may break at the space even though the better break is (arguably) at the hyphen.

Tokenization is particularly difficult for languages written in [[scriptio continua]], which exhibit no word boundaries, such as [[Ancient Greek]], [[Chinese language|Chinese]],<ref>Huang, C., Simon, P., Hsieh, S., & Prevot, L. (2007) [http://www.aclweb.org/anthology/P/P07/P07-2018.pdf Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Word break Identification]</ref> or [[Thai language|Thai]]. [[Agglutinative language]]s, such as Korean, also make tokenization tasks complicated.

Some ways to address the more difficult problems include developing more complex heuristics, querying a table of common special cases, or fitting the tokens to a [[language model]] that identifies collocations in a later processing step.