Editing R-tree (section)

== R-tree idea ==

The key idea of the data structure is to group nearby objects and represent them with their [[minimum bounding rectangle]] in the next higher level of the tree; the "R" in R-tree is for rectangle. Since all objects lie within this bounding rectangle, a query that does not intersect the bounding rectangle also cannot intersect any of the contained objects. At the leaf level, each rectangle describes a single object; at higher levels the aggregation includes an increasing number of objects. This can also be seen as an increasingly coarse approximation of the data set.

Similar to the [[B-tree]], the R-tree is also a balanced search tree (so all leaf nodes are at the same depth), organizes the data in pages, and is designed for storage on disk (as used in [[database]]s). Each page can contain a maximum number of entries, often denoted as <math>M</math>. It also guarantees a minimum fill (except for the root node), however best performance has been experienced with a minimum fill of 30%–40% of the maximum number of entries (B-trees guarantee 50% page fill, and [[B*-tree]]s even 66%). The reason for this is the more complex balancing required for spatial data as opposed to linear data stored in B-trees.

As with most trees, the searching algorithms (e.g., [[intersection (set theory)|intersection]], containment, [[nearest neighbor search]]) are rather simple. The key idea is to use the bounding boxes to decide whether or not to search inside a subtree. In this way, most of the nodes in the tree are never read during a search. Like B-trees, R-trees are suitable for large data sets and [[database]]s, where nodes can be paged to memory when needed, and the whole tree cannot be kept in main memory. Even if data can be fit in memory (or cached), the R-trees in most practical applications will usually provide performance advantages over naive check of all objects when the number of objects is more than few hundred or so. However, for in-memory applications, there are similar alternatives that can provide slightly better performance or be simpler to implement in practice. {{citation needed|reason=Not only would it be nice to provide proof, but also, for readers interested in this area, it would be useful to see the more memory-suited methods|date=October 2023}} To maintain in-memory computing for R-tree in a computer cluster where computing nodes are connected by a network, researchers have used RDMA ([[Remote direct memory access|Remote Direct Memory Access]]) to implement data-intensive applications under R-tree in a distributed environment.<ref>{{Cite conference |author= Mengbai Xiao, Hao Wang, Liang Geng, Rubao Lee, and Xiaodong Zhang| year=2022|pages=1–26|title=" An RDMA-enabled In-memory Computing Platform for R-tree on Clusters" |conference= ACM Transactions on Spatial Algorithms and Systems|doi=10.1145/3503513 |doi-access=free}}</ref> This approach is scalable for increasingly large applications and achieves high throughput and low latency performance for R-tree.

The key difficulty of R-tree is to build an efficient tree that on one hand is balanced (so the leaf nodes are at the same height) on the other hand the rectangles do not cover too much empty space and do not overlap too much (so that during search, fewer subtrees need to be processed). For example, the original idea for inserting elements to obtain an efficient tree is to always insert into the subtree that requires least enlargement of its bounding box. Once that page is full, the data is split into two sets that should cover the minimal area each. Most of the research and improvements for R-trees aims at improving the way the tree is built and can be grouped into two objectives: building an efficient tree from scratch (known as bulk-loading) and performing changes on an existing tree (insertion and deletion).

R-trees do not guarantee good [[worst-case performance]], but generally perform well with real-world data.<ref>{{Cite book | last1 = Hwang | first1 = S. | last2 = Kwon | first2 = K. | last3 = Cha | first3 = S. K. | last4 = Lee | first4 = B. S. | chapter = Performance Evaluation of Main-Memory R-tree Variants | doi = 10.1007/978-3-540-45072-6_2 | title = Advances in Spatial and Temporal Databases | series = Lecture Notes in Computer Science | volume = 2750 | pages = [https://archive.org/details/advancesinspatia0000sstd/page/10 10] | year = 2003 | isbn = 978-3-540-40535-1 | chapter-url-access = registration | chapter-url = https://archive.org/details/advancesinspatia0000sstd/page/10 }}</ref> While more of theoretical interest, the (bulk-loaded) [[Priority R-tree]] variant of the R-tree is worst-case optimal,<ref name="prtree">{{Cite book | last1 = Arge | first1 = L. | author1-link = Lars Arge| last2 = De Berg | first2 = M. | last3 = Haverkort | first3 = H. J. | last4 = Yi | first4 = K. | chapter = The Priority R-tree | doi = 10.1145/1007568.1007608 | title = Proceedings of the 2004 ACM SIGMOD international conference on Management of data – SIGMOD '04 | pages = 347 | year = 2004 | isbn = 978-1581138597 | s2cid = 6817500 | url = http://doi.acm.org/10.1145/1007568.1007608 | chapter-url = http://www.win.tue.nl/~mdberg/Papers/prtree.pdf}}</ref> but due to the increased complexity, has not received much attention in practical applications so far.

When data is organized in an R-tree, the neighbors within a given distance r and the [[k nearest neighbors]] (for any [[Lp space|L<sup>p</sup>-Norm]]) of all points can efficiently be computed using a spatial join.<ref>{{Cite journal | doi = 10.1145/170036.170075 | title = Efficient processing of spatial joins using R-trees | year = 1993 | last1 = Brinkhoff | first1 = T. | last2 = Kriegel | first2 = H. P. | author-link2=Hans-Peter Kriegel| last3 = Seeger | first3 = B. | journal = ACM SIGMOD Record | volume = 22 | issue = 2 | pages = 237| citeseerx = 10.1.1.72.4514 | s2cid = 7810650 }}</ref><ref>{{Cite book|last1=Böhm|first1=Christian|last2=Krebs|first2=Florian|title=Database and Expert Systems Applications |chapter=Supporting KDD Applications by the k-Nearest Neighbor Join |date=2003-09-01|series=Lecture Notes in Computer Science|volume=2736 |language=en|publisher=Springer, Berlin, Heidelberg|pages=504–516|doi=10.1007/978-3-540-45227-0_50|isbn=9783540408062|citeseerx=10.1.1.71.454}}</ref> This is beneficial for many algorithms based on such queries, for example the [[Local Outlier Factor]]. DeLi-Clu,<ref>{{cite conference
 | last1 = Achtert | first1 = Elke
 | last2 = Böhm | first2 = Christian
 | last3 = Kröger | first3 = Peer
 | editor1-last = Ng | editor1-first = Wee Keong
 | editor2-last = Kitsuregawa | editor2-first = Masaru
 | editor3-last = Li | editor3-first = Jianzhong
 | editor4-last = Chang | editor4-first = Kuiyu
 | contribution = DeLi-Clu: Boosting Robustness, Completeness, Usability, and Efficiency of Hierarchical Clustering by a Closest Pair Ranking
 | doi = 10.1007/11731139_16
 | pages = 119–128
 | publisher = Springer
 | series = Lecture Notes in Computer Science
 | title = Advances in Knowledge Discovery and Data Mining, 10th Pacific-Asia Conference, PAKDD 2006, Singapore, April 9-12, 2006, Proceedings
 | volume = 3918
 | year = 2006}}</ref> Density-Link-Clustering is a [[cluster analysis]] algorithm that uses the R-tree structure for a similar kind of spatial join to efficiently compute an [[OPTICS algorithm|OPTICS]] clustering.