Editing R-tree (section)

== Algorithm ==

=== Data layout ===
Data in R-trees is organized in pages that can have a variable number of entries (up to some pre-defined maximum, and usually above a minimum fill). Each entry within a non-[[leaf node]] stores two pieces of data: a way of identifying a [[child node]], and the [[bounding box]] of all entries within this child node. Leaf nodes store the data required for each child, often a point or bounding box representing the child and an external identifier for the child. For point data, the leaf entries can be just the points themselves. For polygon data (that often requires the storage of large polygons) the common setup is to store only the MBR (minimum bounding rectangle) of the polygon along with a unique identifier in the tree.

=== Search ===

The search process in an R-tree embodies a two-phase approach that aligns with the [[Filter and refine| Filter and Refine Principle (FRP)]]. In this structure, the internal nodes serve as an initial filter by quickly excluding regions of space that do not intersect the query, while the leaf nodes provide a refined, precise evaluation by storing the actual spatial objects.

Specifically, in [[range searching]], the input is a search rectangle (Query box). Searching is quite similar to searching in a [[B+ tree]]. The search starts from the root node of the tree. Every internal node contains a set of rectangles and pointers to the corresponding child node and every leaf node contains the rectangles of spatial objects (the pointer to some spatial object can be there). For every rectangle in a node, it has to be decided if it overlaps the search rectangle or not. If yes, the corresponding child node has to be searched also. Searching is done like this in a recursive manner until all overlapping nodes have been traversed. When a leaf node is reached, the contained bounding boxes (rectangles) are tested against the search rectangle and their objects (if there are any) are put into the result set if they lie within the search rectangle.

For priority search such as [[nearest neighbor search]], the query consists of a point or rectangle. The root node is inserted into the priority queue. Until the queue is empty or the desired number of results have been returned the search continues by processing the nearest entry in the queue. Tree nodes are expanded and their children reinserted. Leaf entries are returned when encountered in the queue.<ref>{{Cite conference | doi = 10.1109/ICICS.1997.652114| chapter = Fast k nearest neighbour search for R-tree family| title = Proceedings of ICICS, 1997 International Conference on Information, Communications and Signal Processing. Theme: Trends in Information Systems Engineering and Wireless Multimedia Communications (Cat. No.97TH8237)| pages = 924| year = 1997| last1 = Kuan | first1 = J.| last2 = Lewis | first2 = P.| isbn = 0-7803-3676-3}}</ref> This approach can be used with various distance metrics, including [[great-circle distance]] for geographic data.<ref name=geodetic/>

=== Insertion ===
To insert an object, the tree is traversed recursively from the root node. At each step, all rectangles in the current directory node are examined, and a candidate is chosen using a heuristic such as choosing the rectangle which requires least enlargement. The search then descends into this page, until reaching a leaf node. If the leaf node is full, it must be split before the insertion is made. Again, since an exhaustive search is too expensive, a heuristic is employed to split the node into two. Adding the newly created node to the previous level, this level can again overflow, and these overflows can propagate up to the root node; when this node also overflows, a new root node is created and the tree has increased in height.

==== Choosing the insertion subtree ====
The algorithm needs to decide in which subtree to insert. When a data object is fully contained in a single rectangle, the choice is clear.  When there are multiple options or rectangles in need of enlargement, the choice can have a significant impact on the performance of the tree.

The objects are inserted into the subtree that needs the least enlargement. A Mixture heuristic is employed throughout. What happens next is it tries to minimize the overlap (in case of ties, prefer least enlargement and then least area); at the higher levels, it behaves similar to the R-tree, but on ties again preferring the subtree with smaller area. The decreased overlap of rectangles in the [[R*-tree]] is one of the key benefits over the traditional R-tree.

==== Splitting an overflowing node ====
Since redistributing all objects of a node into two nodes has an exponential number of options, a heuristic needs to be employed to find the best split. In the classic R-tree, Guttman proposed two such heuristics, called QuadraticSplit and LinearSplit. In quadratic split, the algorithm searches for the pair of rectangles that is the worst combination to have in the same node, and puts them as initial objects into the two new groups. It then searches for the entry which has the strongest preference for one of the groups (in terms of area increase) and assigns the object to this group until all objects are assigned (satisfying the minimum fill).

There are other splitting strategies such as Greene's Split,<ref name="greene">{{Cite book | last1 = Greene | first1 = D. | chapter = An implementation and performance analysis of spatial data access methods | doi = 10.1109/ICDE.1989.47268 | title = [1989] Proceedings. Fifth International Conference on Data Engineering | pages = 606–615 | year = 1989 | isbn = 978-0-8186-1915-1 | s2cid = 7957624 }}</ref> the [[R*-tree]] splitting heuristic<ref name="rstar">{{Cite book | last1 = Beckmann | first1 = N. | last2 = Kriegel | first2 = H. P. | author-link2 = Hans-Peter Kriegel| last3 = Schneider | first3 = R. | last4 = Seeger | first4 = B. | chapter = The R*-tree: an efficient and robust access method for points and rectangles | doi = 10.1145/93597.98741 | title = Proceedings of the 1990 ACM SIGMOD international conference on Management of data – SIGMOD '90 | pages = 322 | year = 1990 | isbn = 978-0897913652 | chapter-url = http://dbs.mathematik.uni-marburg.de/publications/myPapers/1990/BKSS90.pdf| citeseerx = 10.1.1.129.3731 | s2cid = 11567855 }}</ref> (which again tries to minimize overlap, but also prefers quadratic pages) or the linear split algorithm proposed by Ang and Tan<ref name="ang-tan">{{cite conference | last1= Ang | first1 = C. H. | last2 = Tan | first2 = T. C. | editor1-first = Michel | editor1-last = Scholl | editor2-first = Agnès | editor2-last = Voisard | year = 1997 | title = New linear node splitting algorithm for R-trees | book-title = Proceedings of the 5th International Symposium on Advances in Spatial Databases (SSD '97), Berlin, Germany, July 15–18, 1997 | volume = 1262 |series=Lecture Notes in Computer Science | publisher=Springer | pages = 337–349 | doi = 10.1007/3-540-63238-7_38}}</ref> (which however can produce very irregular rectangles, which are less performant for many real world range and window queries). In addition to having a more advanced splitting heuristic, the [[R*-tree]] also tries to avoid splitting a node by reinserting some of the node members, which is similar to the way a [[B-tree]] balances overflowing nodes. This was shown to also reduce overlap and thus increase tree performance.

Finally, the [[X-tree]]<ref name="xtree2">{{Cite journal| first1 = Stefan | last1 = Berchtold| first2 = Daniel A. | last2 = Keim| first3 = Hans-Peter | last3 = Kriegel| author3-link = Hans-Peter Kriegel| title = The X-Tree: An Index Structure for High-Dimensional Data| journal = Proceedings of the 22nd VLDB Conference| place = Mumbai, India| year = 1996| pages = 28–39| url = http://www.dbs.ifi.lmu.de/Publikationen/Papers/x-tree.ps}}</ref> can be seen as a R*-tree variant that can also decide to not split a node, but construct a so-called super-node containing all the extra entries, when it doesn't find a good split (in particular for high-dimensional data).

{{Gallery
|title=Effect of different splitting heuristics on a database with US postal districts
|width=300 | height=300 | align=center |File:R-tree_with_Guttman's_quadratic_split.png|Guttman's quadratic split.<ref name="guttman" /><br />Pages in this tree overlap a lot.
|File:R-tree built with Guttman's linear split.png|Guttman's linear split.<ref name="guttman" /><br />Even worse structure, but also faster to construct.
|File:R-tree built with Greenes Split.png|Greene's split.<ref name="greene" /> Pages overlap much less than with Guttman's strategy.
|File:R-tree built with Ang-Tan linear split.png|Ang-Tan linear split.<ref name="ang-tan" /><br />This strategy produces sliced pages, which often yield bad query performance.
|File:R*-tree built using topological split.png |[[R* tree]] topological split.<ref name="rstar" /><br /> The pages overlap much less since the R*-tree tries to minimize page overlap, and the reinsertions further optimized the tree. The split strategy prefers quadratic pages, which yields better performance for common map applications.
|File:R*-tree bulk loaded with sort-tile-recursive.png|Bulk loaded [[R* tree]] using Sort-Tile-Recursive (STR).<br />The leaf pages do not overlap at all, and the directory pages overlap only little. This is a very efficient tree, but it requires the data to be completely known beforehand.
|File:M-tree built with MMRad split.png|[[M-tree]]s are similar to the R-tree, but use nested spherical pages.<br />Splitting these pages is, however, much more complicated and pages usually overlap much more.
}}

=== Deletion ===
Deleting an entry from a page may require updating the bounding rectangles of parent pages. However, when a page is underfull, it will not be balanced with its neighbors. Instead, the page will be dissolved and all its children (which may be subtrees, not only leaf objects) will be reinserted. If during this process the root node has a single element, the tree height can decrease.

{{Expand section|date=October 2011}}

=== Bulk-loading ===

* Nearest-X: Objects are sorted by their first coordinate ("X") and then split into pages of the desired size.
* Packed [[Hilbert R-tree]]: variation of Nearest-X, but sorting using the Hilbert value of the center of a rectangle instead of using the X coordinate. There is no guarantee the pages will not overlap.
* Sort-Tile-Recursive (STR):<ref>{{cite web | first1 = Scott T. | last1 = Leutenegger | first2 = Jeffrey M. | last2 = Edgington | first3 = Mario A. | last3 = Lopez | url = https://archive.org/details/nasa_techdoc_19970016975 | title = STR: A Simple and Efficient Algorithm for R-Tree Packing |date=February 1997}}</ref> Another variation of Nearest-X, that estimates the total number of leaves required as <math>l=\lceil \text{number of objects} / \text{capacity}\rceil</math>, the required split factor in each dimension to achieve this as <math>s=\lceil l^{1/d}\rceil</math>, then repeatedly splits each dimensions successively into <math>s</math> equal sized partitions using 1-dimensional sorting. The resulting pages, if they occupy more than one page, are again bulk-loaded using the same algorithm. For point data, the leaf nodes will not overlap, and "tile" the data space into approximately equal sized pages.
* Overlap Minimizing Top-down (OMT):<ref>{{cite web | first1 = Taewon | last1 = Lee | first2 = Sukho | last2 = Lee | url = http://ftp.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-74/files/FORUM_18.pdf | title =  OMT: Overlap Minimizing Top-down Bulk Loading Algorithm for R-tree | date=June 2003}}</ref> Improvement over STR using a top-down approach which minimizes overlaps between slices and improves query performance.
* [[Priority R-tree]]

{{Expand section|date=June 2008}}