Editing B-tree (section)

==B-tree usage in databases==
{{Tone|section|date=May 2022}}
===Sorted file search time===
Sorting and searching algorithms can be characterized by the number of comparison operations that must be performed using [[Big O notation|order notation]]. A [[binary search]] of a sorted table with {{mvar|N}} records, for example, can be done in roughly {{math|⌈ log<sub>2</sub> ''N'' ⌉}} comparisons. If the table had 1,000,000 records, then a specific record could be located with at most 20 comparisons: {{math|1=⌈ log<sub>2</sub> (1,000,000) ⌉ = 20}}.

Large databases have historically been kept on disk drives. The time to read a record on a disk drive far exceeds the time needed to compare keys once the record is available due to [[seek time]] and a rotational delay. The seek time may be 0 to 20 or more milliseconds, and the rotational delay averages about half the rotation period. For a 7200 RPM drive, the rotation period is 8.33 milliseconds. For a drive such as the Seagate ST3500320NS, the track-to-track seek time is 0.8 milliseconds and the average reading seek time is 8.5 milliseconds.<ref>{{cite book |publisher=Seagate Technology LLC |title=Product Manual: Barracuda ES.2 Serial ATA, Rev. F., publication 100468393 |date=2008 |url=http://www.seagate.com/staticfiles/support/disc/manuals/NL35%20Series%20&%20BC%20ES%20Series/Barracuda%20ES.2%20Series/100468393f.pdf |page=6}}</ref> For simplicity, assume reading from disk takes about 10 milliseconds.

The time to locate one record out of a million in the example above would take 20 disk reads times 10 milliseconds per disk read, which is 0.2 seconds.

The search time is reduced because individual records are grouped together in a disk '''block'''. A disk block might be 16 kilobytes. If each record is 160 bytes, then 100 records could be stored in each block. The disk read time above was actually for an entire block. Once the disk head is in position, one or more disk blocks can be read with little delay. With 100 records per block, the last 6 or so comparisons don't need to do any disk reads—the comparisons are all within the last disk block read.

To speed up the search further, the time to do the first 13 to 14 comparisons (which each required a disk access) must be reduced.

===Index performance===
A B-tree [[Database index|index]] can be used to improve performance. A B-tree index creates a multi-level tree structure that breaks a database down into fixed-size blocks or pages. Each level of this tree can be used to link those pages via an address location, allowing one page (known as a node, or internal page) to refer to another with leaf pages at the lowest level. One page is typically the starting point of the tree, or the "root". This is where the search for a particular key would begin, traversing a path that terminates in a leaf. Most pages in this structure will be leaf pages which refer to specific table rows. 

Because each node (or internal page) can have more than two children, a B-tree index will usually have a shorter height (the distance from the root to the farthest leaf) than a Binary Search Tree. In the example above, initial disk reads narrowed the search range by a factor of two. That can be improved by creating an auxiliary index that contains the first record in each disk block (sometimes called a [[Database index#Sparse index|sparse index]]). This auxiliary index would be 1% of the size of the original database, but it can be searched quickly. Finding an entry in the auxiliary index would tell us which block to search in the main database; after searching the auxiliary index, we would have to search only that one block of the main database—at a cost of one more disk read. 

In the above example the index would hold 10,000 entries and would take at most 14 comparisons to return a result. Like the main database, the last six or so comparisons in the auxiliary index would be on the same disk block. The index could be searched in about eight disk reads, and the desired record could be accessed in 9 disk reads.

Creating an auxiliary index can be repeated to make an auxiliary index to the auxiliary index. That would make an aux-aux index that would need only 100 entries and would fit in one disk block.

Instead of reading 14 disk blocks to find the desired record, we only need to read 3 blocks. This blocking is the core idea behind the creation of the B-tree, where the disk blocks fill out a hierarchy of levels to make up the index. Reading and searching the first (and only) block of the aux-aux index which is the root of the tree identifies the relevant block in aux-index in the level below. Reading and searching that aux-index block identifies the relevant block to read, until the final level, known as the leaf level, identifies a record in the main database. Instead of 150 milliseconds, we need only 30 milliseconds to get the record.

The auxiliary indices have turned the search problem from a binary search requiring roughly {{math|log<sub>2</sub> ''N''}} disk reads to one requiring only {{math|log<sub>''b''</sub> ''N''}} disk reads where {{mvar|b}} is the blocking factor (the number of entries per block: {{math|1=''b'' = 100}} entries per block in our example; {{math|1=log<sub>100</sub> 1,000,000 = 3}} reads).

In practice, if the main database is being frequently searched, the aux-aux index and much of the aux index may reside in a [[page cache|disk cache]], so they would not incur a disk read. The B-tree remains the standard index implementation in almost all [[relational database]]s, and many nonrelational databases use them too.<ref name="kleppmann_2017">{{cite book| last=Kleppmann| first=Martin| date=2017| title=Designing Data-Intensive Applications| place=[[Sebastopol, California]]| publisher=[[O'Reilly Media]]| pages=80| isbn=978-1-449-37332-0| url=https://www.academia.edu/41298363}}</ref>

===Insertions and deletions===
If the [[database]] does not change, then compiling the index is simple to do, and the index need never be changed. If there are changes, managing the database and its index require additional computation.

Deleting records from a database is relatively easy. The index can stay the same, and the record can just be marked as deleted. The database remains in sorted order. If there are a large number of [[lazy deletion]]s, then searching and storage become less efficient.<ref>
Jan Jannink.
"Implementing Deletion in B+-Trees".
Section [https://www.alexdelis.eu/M149/B+deletion.pdf "4 Lazy Deletion"].
</ref>

Insertions can be very slow in a sorted sequential file because room for the inserted record must be made. Inserting a record before the first record requires shifting all of the records down one. Such an operation is just too expensive to be practical. One solution is to leave some spaces. Instead of densely packing all the records in a block, the block can have some free space to allow for subsequent insertions. Those spaces would be marked as if they were "deleted" records.

Both insertions and deletions are fast as long as space is available on a block. If an insertion won't fit on the block, then some free space on some nearby block must be found and the auxiliary indices adjusted. The best case is that enough space is available nearby so that the amount of block reorganization can be minimized. Alternatively, some out-of-sequence disk blocks may be used.<ref name="kleppmann_2017" />

===Usage in databases===
The B-tree uses all of the ideas described above. In particular, a B-tree:
* keeps keys in sorted order for sequential traversing
* uses a hierarchical index to minimize the number of disk reads
* uses partially full blocks to speed up insertions and deletions
* keeps the index balanced with a recursive algorithm

In addition, a B-tree minimizes waste by making sure the interior nodes are at least half full. A B-tree can handle an arbitrary number of insertions and deletions.<ref name="kleppmann_2017" />