Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
External sorting
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Additional passes=== The previous example is a two-pass sort: first sort, then merge. The sort ends with a single ''k''-way merge, rather than a series of two-way merge passes as in a typical in-memory merge sort. This is because each merge pass reads and writes ''every value'' from and to disk, so reducing the number of passes more than compensates for the additional cost of a ''k''-way merge. The limitation to single-pass merging is that as the number of chunks increases, memory will be divided into more buffers, so each buffer is smaller. Eventually, the reads become so small that more time is spent on [[disk seek]]s than data transfer. A typical magnetic [[hard disk drive]] might have a 10 ms access time and 100 MB/s data transfer rate, so each seek takes as much time as transferring 1 MB of data. Thus, for sorting, say, 50 GB in 100 MB of RAM, using a single 500-way merge pass isn't efficient: we can only read 100 MB / 501 β 200 KB from each chunk at once, so 5/6 of the disk's time is spent seeking. Using two merge passes solves the problem. Then the sorting process might look like this: # Run the initial chunk-sorting pass as before to create 500Γ100 MB sorted chunks. # Run a first merge pass combining 25Γ100 MB chunks at a time, resulting in 20Γ2.5 GB sorted chunks. # Run a second merge pass to merge the 20Γ2.5 GB sorted chunks into a single 50 GB sorted result Although this requires an additional pass over the data, each read is now 4 MB long, so only 1/5 of the disk's time is spent seeking. The improvement in data transfer efficiency during the merge passes (16.6% to 80% is almost a 5× improvement) more than makes up for the doubled number of merge passes. Variations include using an intermediate medium like [[solid-state disk]] for some stages; the fast temporary storage needn't be big enough to hold the whole dataset, just substantially larger than available main memory. Repeating the example above with 1 GB of temporary SSD storage, the first pass could merge 10Γ100 MB sorted chunks read from that temporary space to write 50x1 GB sorted chunks to HDD. The high bandwidth and random-read throughput of SSDs help speed the first pass, and the HDD reads for the second pass can then be 2 MB, large enough that seeks will not take up most of the read time. SSDs can also be used as read buffers in a merge phase, allowing fewer larger reads (20MB reads in this example) from HDD storage. Given the lower cost of SSD capacity relative to RAM, SSDs can be an economical tool for sorting large inputs with very limited memory. Like in-memory sorts, efficient external sorts require [[Big O notation|O]](''n'' log ''n'') time: exponentially growing datasets require linearly increasing numbers of passes that each take O(n) time.<ref>One way to see this is that given a fixed amount of memory (say, 1GB) and a minimum read size (say, 2MB), each merge pass can merge a certain number of runs (such as 500) into one, creating a divide-and-conquer situation similar to in-memory merge sort. The size of each main-memory sort and number of ways in each merge have a constant upper bound, so they don't contribute to the big-O.</ref> Under reasonable assumptions at least 500 GB of data stored on a hard drive can be sorted using 1 GB of main memory before a third pass becomes advantageous, and many times that much data can be sorted before a fourth pass becomes useful.<ref>For an example, assume 500 GB of data to sort, 1 GB of buffer memory, and a single disk with 200 MB/s transfer rate and 20 ms seek time. A single 500-way merging phase will use buffers of 2 MB each, and need to do 250 K seeks while reading then writing 500 GB. It will spend 5,000 seconds seeking and 5,000 s transferring. Doing two merge passes as described above would nearly eliminate the seek time but add an additional 5,000 s of data transfer time, so this is approximately the break-even point between a two-pass and three-pass sort.</ref> Main memory size is important. Doubling memory dedicated to sorting halves the number of chunks ''and'' the number of reads per chunk, reducing the number of seeks required by about three-quarters. The ratio of RAM to disk storage on servers often makes it convenient to do huge sorts on a cluster of machines<ref>Chris Nyberg, Mehul Shah, [http://sortbenchmark.org/ Sort Benchmark Home Page] (links to examples of parallel sorts)</ref> rather than on one machine with multiple passes. Media with high random-read performance like [[solid-state drive]]s (SSDs) also increase the amount that can be sorted before additional passes improve performance.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)