Editing Collation (section)

==Automation==
When information is stored in digital systems, collation may become an automated process. It is then necessary to implement an appropriate collation [[algorithm]] that allows the information to be sorted in a satisfactory manner for the application in question. Often the aim will be to achieve an alphabetical or numerical ordering that follows the standard criteria as described in the preceding sections. However, not all of these criteria are easy to automate.<ref name="Walters">[https://books.google.com/books?id=5Pd_iFM4eLsC&dq=%22collation+algorithms%22&pg=PA278 ''M Programming: A Comprehensive Guide''], Richard F. Walters, Digital Press, 1997</ref>

The simplest kind of automated collation is based on the numerical codes of the symbols in a [[character set]], such as [[ASCII]] coding (or any of its [[superset]]s such as [[Unicode]]), with the symbols being ordered in increasing numerical order of their codes, and this ordering being extended to strings in accordance with the basic principles of alphabetical ordering (mathematically speaking, [[lexicographical order]]ing). So a computer program might treat the characters ''a'', ''b'', ''C'', ''d'', and ''$'' as being ordered ''$'', ''C'', ''a'', ''b'', ''d'' (the corresponding ASCII codes are ''$'' = 36, ''a'' = 97, ''b'' = 98, ''C'' = 67, and ''d'' = 100). Therefore, strings beginning with ''C'', ''M'', or ''Z'' would be sorted before strings with lower-case ''a'', ''b'', etc. This is sometimes called ''[[ASCIIbetical order]]''. This deviates from the standard alphabetical order, particularly due to the ordering of capital letters before all lower-case ones (and possibly the treatment of spaces and other non-letter characters). It is therefore often applied with certain alterations, the most obvious being case conversion (often to uppercase, for historical reasons<ref group="note">Historically, computers only handled text in uppercase (this dates back to [[telegraph]] conventions).</ref>) before comparison of ASCII values.

In many collation algorithms, the comparison is based not on the numerical codes of the characters, but with reference to the '''collating sequence''' – a sequence in which the characters are assumed to come for the purpose of collation – as well as other ordering rules appropriate to the given application. This can serve to apply the correct conventions used for alphabetical ordering in the language in question, dealing properly with differently cased letters, [[modified letter]]s, [[digraph (orthography)|digraphs]], particular abbreviations, and so on, as mentioned above under [[#Alphabetical order|Alphabetical order]], and in detail in the [[Alphabetical order]] article. Such algorithms are potentially quite complex, possibly requiring several passes through the text.<ref name="Walters"/>

Problems are nonetheless still common when the algorithm has to encompass more than one language. For example, in [[German (language)|German]] dictionaries the word ''ökonomisch'' comes between ''offenbar'' and ''olfaktorisch'', while [[Turkish language|Turkish]] dictionaries treat ''o'' and ''ö'' as different letters, placing ''oyun'' before ''öbür''.

A standard algorithm for collating any collection of strings composed of any standard [[Unicode]] symbols is the [[Unicode Collation Algorithm]]. This can be adapted to use the appropriate collation sequence for a given language by tailoring its default collation table. Several such tailorings are collected in [[Common Locale Data Repository]].

===Sort keys===
In some applications, the strings by which items are collated may differ from the identifiers that are displayed. For example, ''The Shining'' might be [[sorting|sorted]] as ''Shining, The'' (see [[#Alphabetical order|Alphabetical order]] above), but it may still be desired to display it as ''The Shining''. In this case two sets of strings can be stored, one for display purposes, and another for collation purposes. Strings used for collation in this way are called ''sort keys''.

===Issues with numbers===
Sometimes, it is desired to order text with embedded numbers using proper numerical order. For example, "Figure 7b" goes before "Figure 11a", even though '7' comes after '1' in [[Unicode]]. This can be extended to [[Roman numeral]]s. This behavior is not particularly difficult to produce as long as only integers are to be sorted, although it can slow down sorting significantly. For example, [[Microsoft Windows]] does this when sorting [[file name]]s.

Sorting decimals properly is a bit more difficult, because different locales use different symbols for a [[decimal separator|decimal point]], and sometimes the same character used as a [[Decimal mark|decimal point]] is also used as a separator, for example "Section 3.2.5". There is no universal answer for how to sort such strings; any rules are application dependent.