Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Collation
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Automation== When information is stored in digital systems, collation may become an automated process. It is then necessary to implement an appropriate collation [[algorithm]] that allows the information to be sorted in a satisfactory manner for the application in question. Often the aim will be to achieve an alphabetical or numerical ordering that follows the standard criteria as described in the preceding sections. However, not all of these criteria are easy to automate.<ref name="Walters">[https://books.google.com/books?id=5Pd_iFM4eLsC&dq=%22collation+algorithms%22&pg=PA278 ''M Programming: A Comprehensive Guide''], Richard F. Walters, Digital Press, 1997</ref> The simplest kind of automated collation is based on the numerical codes of the symbols in a [[character set]], such as [[ASCII]] coding (or any of its [[superset]]s such as [[Unicode]]), with the symbols being ordered in increasing numerical order of their codes, and this ordering being extended to strings in accordance with the basic principles of alphabetical ordering (mathematically speaking, [[lexicographical order]]ing). So a computer program might treat the characters ''a'', ''b'', ''C'', ''d'', and ''$'' as being ordered ''$'', ''C'', ''a'', ''b'', ''d'' (the corresponding ASCII codes are ''$'' = 36, ''a'' = 97, ''b'' = 98, ''C'' = 67, and ''d'' = 100). Therefore, strings beginning with ''C'', ''M'', or ''Z'' would be sorted before strings with lower-case ''a'', ''b'', etc. This is sometimes called ''[[ASCIIbetical order]]''. This deviates from the standard alphabetical order, particularly due to the ordering of capital letters before all lower-case ones (and possibly the treatment of spaces and other non-letter characters). It is therefore often applied with certain alterations, the most obvious being case conversion (often to uppercase, for historical reasons<ref group="note">Historically, computers only handled text in uppercase (this dates back to [[telegraph]] conventions).</ref>) before comparison of ASCII values. In many collation algorithms, the comparison is based not on the numerical codes of the characters, but with reference to the '''collating sequence''' – a sequence in which the characters are assumed to come for the purpose of collation – as well as other ordering rules appropriate to the given application. This can serve to apply the correct conventions used for alphabetical ordering in the language in question, dealing properly with differently cased letters, [[modified letter]]s, [[digraph (orthography)|digraphs]], particular abbreviations, and so on, as mentioned above under [[#Alphabetical order|Alphabetical order]], and in detail in the [[Alphabetical order]] article. Such algorithms are potentially quite complex, possibly requiring several passes through the text.<ref name="Walters"/> Problems are nonetheless still common when the algorithm has to encompass more than one language. For example, in [[German (language)|German]] dictionaries the word ''ökonomisch'' comes between ''offenbar'' and ''olfaktorisch'', while [[Turkish language|Turkish]] dictionaries treat ''o'' and ''ö'' as different letters, placing ''oyun'' before ''öbür''. A standard algorithm for collating any collection of strings composed of any standard [[Unicode]] symbols is the [[Unicode Collation Algorithm]]. This can be adapted to use the appropriate collation sequence for a given language by tailoring its default collation table. Several such tailorings are collected in [[Common Locale Data Repository]]. ===Sort keys=== In some applications, the strings by which items are collated may differ from the identifiers that are displayed. For example, ''The Shining'' might be [[sorting|sorted]] as ''Shining, The'' (see [[#Alphabetical order|Alphabetical order]] above), but it may still be desired to display it as ''The Shining''. In this case two sets of strings can be stored, one for display purposes, and another for collation purposes. Strings used for collation in this way are called ''sort keys''. ===Issues with numbers=== Sometimes, it is desired to order text with embedded numbers using proper numerical order. For example, "Figure 7b" goes before "Figure 11a", even though '7' comes after '1' in [[Unicode]]. This can be extended to [[Roman numeral]]s. This behavior is not particularly difficult to produce as long as only integers are to be sorted, although it can slow down sorting significantly. For example, [[Microsoft Windows]] does this when sorting [[file name]]s. Sorting decimals properly is a bit more difficult, because different locales use different symbols for a [[decimal separator|decimal point]], and sometimes the same character used as a [[Decimal mark|decimal point]] is also used as a separator, for example "Section 3.2.5". There is no universal answer for how to sort such strings; any rules are application dependent.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)