Editing Disjoint-set data structure

{{Short description|Data structure for storing non-overlapping sets}}
{{Infobox data structure
| name = Disjoint-set/Union-find Forest
| type = multiway tree
| invented_by = [[Bernard A. Galler]] and [[Michael J. Fischer]]
| invented_year = 1964|
| space_avg = {{math|''O''(''n'')}}<ref name="tarjan1975">{{cite journal|last1=Tarjan|first1=Robert Endre|author1-link=Robert E. Tarjan|year=1975|title=Efficiency of a Good But Not Linear Set Union Algorithm|journal=Journal of the ACM|volume=22|issue=2|pages=215&ndash;225|doi=10.1145/321879.321884|hdl=1813/5942|s2cid=11105749|hdl-access=free }}</ref>
| space_worst = {{math|''O''(''n'')}}<ref name="tarjan1975"/>
| search_avg = {{math|''O''(α(''n''))}}<ref name="tarjan1975"/> (amortized)
| search_worst = {{math|''O''(α(''n''))}}<ref name="tarjan1975"/> (amortized)
| insert_avg = {{math|''O''(1)}}<ref name="tarjan1975"/>
| insert_worst = {{math|''O''(1)}}<ref name="tarjan1975"/>
}}

In [[computer science]], a '''disjoint-set data structure''', also called a '''union–find data structure''' or '''merge–find set''', is a [[data structure]] that stores a collection of [[Disjoint sets|disjoint]] (non-overlapping) [[Set (mathematics)|sets]]. Equivalently, it stores a [[partition of a set]] into disjoint [[subset]]s. It provides operations for adding new sets, merging sets (replacing them with their [[Union (set theory)|union]]), and finding a representative member of a set. The last operation makes it possible to determine efficiently whether any two elements belong to the same set or to different sets.

While there are several ways of implementing disjoint-set data structures, in practice they are often identified with a particular implementation known as a '''disjoint-set forest'''. This specialized type of [[Forest (graph theory)|forest]] performs union and find operations in near-constant [[Amortized analysis|amortized time]]. For a sequence of {{mvar|m}} addition, union, or find operations on a disjoint-set forest with {{mvar|n}} nodes, the total time required is {{math|[[Big O notation|''O'']](''m''α(''n''))}}, where {{math|α(''n'')}} is the extremely slow-growing [[inverse Ackermann function]]. Although disjoint-set forests do not guarantee this time per operation, each operation rebalances the structure (via tree compression) so that subsequent operations become faster. As a result, disjoint-set forests are both [[asymptotically optimal]] and practically efficient.

Disjoint-set data structures play a key role in [[Kruskal's algorithm]] for finding the [[minimum spanning tree]] of a graph. The importance of minimum spanning trees means that disjoint-set data structures support a wide variety of algorithms. In addition, these data structures find applications in symbolic computation and in compilers, especially for [[register allocation]] problems.

== History ==

Disjoint-set forests were first described by [[Bernard A. Galler]] and [[Michael J. Fischer]] in 1964.<ref name="Galler1964">{{cite journal|first1=Bernard A.|last1=Galler|author1-link=Bernard A. Galler|first2=Michael J.|last2=Fischer|author2-link=Michael J. Fischer|title=An improved equivalence algorithm|journal=[[Communications of the ACM]]|volume=7|issue=5|date=May 1964|pages=301–303|doi=10.1145/364099.364331|s2cid=9034016 |doi-access=free}}. The paper originating disjoint-set forests.</ref> In 1973, their time complexity was bounded to <math>O(\log^{*}(n))</math>, the [[iterated logarithm]] of <math>n</math>, by [[John Hopcroft|Hopcroft]] and [[Jeffrey Ullman|Ullman]].<ref name="Hopcroft1973">{{cite journal|last1=Hopcroft|first1=J. E.|author1-link=John Hopcroft|last2=Ullman|first2=J. D.|author2-link=Jeffrey Ullman|year=1973|title=Set Merging Algorithms|journal=SIAM Journal on Computing|volume=2|issue=4|pages=294&ndash;303|doi=10.1137/0202024}}</ref> In 1975, [[Robert Tarjan]] was the first to prove the <math>O(m\alpha(n))</math> ([[Ackermann function#Inverse|inverse Ackermann function]]) upper bound on the algorithm's time complexity.<ref name="Tarjan1984">{{cite journal|first1=Robert E.|last1=Tarjan|author1-link=Robert E. Tarjan|first2=Jan|last2=van Leeuwen|author2-link=Jan van Leeuwen|title=Worst-case analysis of set union algorithms|journal=Journal of the ACM|volume=31|issue=2|pages=245–281|year=1984|doi= 10.1145/62.2160|s2cid=5363073 |doi-access=free}}</ref> He also proved it to be tight. In 1979, he showed that this was the lower bound for a certain class of algorithms, that include the Galler-Fischer structure.<ref name="Tarjan1979">{{cite journal|first=Robert Endre|last=Tarjan|author-link=Robert E. Tarjan|year=1979|title=A class of algorithms which require non-linear time to maintain disjoint sets|journal=Journal of Computer and System Sciences|volume=18|issue=2|pages=110&ndash;127|doi=10.1016/0022-0000(79)90042-4|doi-access=free }}</ref> In 1989, [[Michael Fredman|Fredman]] and [[Michael Saks (mathematician)|Saks]] showed that <math>\Omega(\alpha(n))</math> (amortized) words of <math>O(\log n)</math> bits must be accessed by ''any'' disjoint-set data structure per operation,<ref name="Fredman1989">{{cite book|first1=M.|last1=Fredman|author-link=Michael Fredman|first2=M.|last2=Saks|title=Proceedings of the twenty-first annual ACM symposium on Theory of computing - STOC '89 |chapter=The cell probe complexity of dynamic data structures |pages=345&ndash;354|date=May 1989|doi=10.1145/73007.73040|isbn=0897913078|s2cid=13470414|quote=Theorem 5: Any CPROBE(log ''n'') implementation of the set union problem requires Ω(''m'' α(''m'', ''n'')) time to execute ''m'' Find's and ''n''&minus;1 Union's, beginning with ''n'' singleton sets. |doi-access=free}}</ref> thereby proving the optimality of the data structure in this model.

In 1991, Galil and Italiano published a survey of data structures for disjoint-sets.<ref name="Galil1991">{{cite journal|first1=Z.|last1=Galil|first2=G.|last2=Italiano|title=Data structures and algorithms for disjoint set union problems|journal=ACM Computing Surveys|volume=23|issue=3|pages=319&ndash;344|year=1991|doi=10.1145/116873.116878|s2cid=207160759 }}</ref>

In 1994, Richard J. Anderson and Heather Woll described a parallelized version of Union–Find that never needs to block.<ref name="Anderson1994">{{cite conference|first1=Richard J.|last1=Anderson|first2=Heather|last2=Woll|title=Wait-free Parallel Algorithms for the Union-Find Problem|conference=23rd ACM Symposium on Theory of Computing|year=1994|pages=370&ndash;380}}</ref>

In 2007, Sylvain Conchon and Jean-Christophe Filliâtre developed a semi-[[persistent data structure|persistent]] version of the disjoint-set forest data structure and formalized its correctness using the [[proof assistant]] [[Coq (software)|Coq]].<ref name="Conchon2007">{{cite conference|first1=Sylvain|last1=Conchon|first2=Jean-Christophe|last2=Filliâtre|contribution=A Persistent Union-Find Data Structure|title=ACM SIGPLAN Workshop on ML|location=Freiburg, Germany|date=October 2007|url=https://www.lri.fr/~filliatr/puf/}}</ref> "Semi-persistent" means that previous versions of the structure are efficiently retained, but accessing previous versions of the data structure invalidates later ones. Their fastest implementation achieves performance almost as efficient as the non-persistent algorithm. They do not perform a complexity analysis.

Variants of disjoint-set data structures with better performance on a restricted class of problems have also been considered.  Gabow and Tarjan showed that if the possible unions are restricted in certain ways, then a truly linear time algorithm is possible.<ref>Harold N. Gabow, Robert Endre Tarjan, "A linear-time algorithm for a special case of disjoint set union," Journal of Computer and System Sciences, Volume 30, Issue 2, 1985, pp. 209–221, ISSN 0022-0000, https://doi.org/10.1016/0022-0000(85)90014-5</ref>

== Representation ==

Each node in a disjoint-set forest consists of a pointer and some auxiliary information, either a size or a rank (but not both).  The pointers are used to make [[parent pointer tree]]s, where each node that is not the root of a tree points to its parent.  To distinguish root nodes from others, their parent pointers have invalid values, such as a circular reference to the node or a sentinel value.  Each tree represents a set stored in the forest, with the members of the set being the nodes in the tree.  Root nodes provide set representatives: Two nodes are in the same set if and only if the roots of the trees containing the nodes are equal.

Nodes in the forest can be stored in any way convenient to the application, but a common technique is to store them in an array.  In this case, parents can be indicated by their array index.  Every array entry requires {{math|Θ(log ''n'')}} bits of storage for the parent pointer.  A comparable or lesser amount of storage is required for the rest of the entry, so the number of bits required to store the forest is {{math|Θ(''n'' log ''n'')}}.  If an implementation uses fixed size nodes (thereby limiting the maximum size of the forest that can be stored), then the necessary storage is linear in {{mvar|n}}.

== Operations ==

Disjoint-set data structures support three operations: Making a new set containing a new element; Finding the representative of the set containing a given element; and Merging two sets.

=== Making new sets ===

The <code>MakeSet</code> operation adds a new element into a new set containing only the new element, and the new set is added to the data structure.  If the data structure is instead viewed as a partition of a set, then the <code>MakeSet</code> operation enlarges the set by adding the new element, and it extends the existing partition by putting the new element into a new subset containing only the new element.

In a disjoint-set forest, <code>MakeSet</code> initializes the node's parent pointer and the node's size or rank.  If a root is represented by a node that points to itself, then adding an element can be described using the following pseudocode:

 '''function''' MakeSet(''x'') '''is'''
     '''if''' ''x'' is not already in the forest '''then'''
         ''x''.parent := ''x''
         ''x''.size := 1     ''// if nodes store size''
         ''x''.rank := 0     ''// if nodes store rank''
     '''end if'''
 '''end function'''

This operation has linear time complexity.  In particular, initializing a
disjoint-set forest with {{mvar|n}} nodes requires {{math|''O''(''n'')}}
time.

Lack of a parent assigned to the node implies that the node is not present in the forest.

In practice, <code>MakeSet</code> must be preceded by an operation that allocates memory to hold {{math|x}}.  As long as memory allocation is an amortized constant-time operation, as it is for a good [[dynamic array]] implementation, it does not change the asymptotic performance of the random-set forest.

=== Finding set representatives ===

The <code>Find</code> operation follows the chain of parent pointers from a specified query node {{mvar|x}} until it reaches a root element.  This root element represents the set to which {{mvar|x}} belongs and may be {{mvar|x}} itself.  <code>Find</code> returns the root element it reaches.

Performing a <code>Find</code> operation presents an important opportunity for improving the forest.  The time in a <code>Find</code> operation is spent chasing parent pointers, so a flatter tree leads to faster <code>Find</code> operations.  When a <code>Find</code> is executed, there is no faster way to reach the root than by following each parent pointer in succession.  However, the parent pointers visited during this search can be updated to point closer to the root.  Because every element visited on the way to a root is part of the same set, this does not change the sets stored in the forest.  But it makes future <code>Find</code> operations faster, not only for the nodes between the query node and the root, but also for their descendants.  This updating is an important part of the disjoint-set forest's amortized performance guarantee.

There are several algorithms for <code>Find</code> that achieve the asymptotically optimal time complexity.  One family of algorithms, known as '''path compression''', makes every node between the query node and the root point to the root.  Path compression can be implemented using a simple recursion as follows:

 '''function''' Find(''x'') '''is'''
     '''if''' ''x''.parent ≠ ''x'' '''then'''
         ''x''.parent := Find(''x''.parent)
         '''return''' ''x''.parent
     '''else'''
         '''return''' ''x''
     '''end if'''
 '''end function'''

This implementation makes two passes, one up the tree and one back down.  It requires enough scratch memory to store the path from the query node to the root (in the above pseudocode, the path is implicitly represented using the call stack).  This can be decreased to a constant amount of memory by performing both passes in the same direction.  The constant memory implementation walks from the query node to the root twice, once to find the root and once to update pointers:

 '''function''' Find(''x'') '''is'''
     ''root'' := ''x''
     '''while''' ''root''.parent ≠ ''root'' '''do'''
         ''root'' := ''root''.parent
     '''end while'''
 
     '''while''' ''x''.parent ≠ ''root'' '''do'''
         ''parent'' := ''x''.parent
         ''x''.parent := ''root''
         ''x'' := ''parent''
     '''end while'''
 
     '''return''' ''root''
 '''end function'''

[[Robert E. Tarjan|Tarjan]] and [[Jan van Leeuwen|Van Leeuwen]] also developed one-pass <code>Find</code> algorithms that retain the same worst-case complexity but are more efficient in practice.<ref name="Tarjan1984"/>  These are called path splitting and path halving.  Both of these update the parent pointers of nodes on the path between the query node and the root.  '''Path splitting''' replaces every parent pointer on that path by a pointer to the node's grandparent:

 '''function''' Find(''x'') '''is'''
     '''while''' ''x''.parent ≠ ''x'' '''do'''
         (''x'', ''x''.parent) := (''x''.parent, ''x''.parent.parent)
     '''end while'''
     '''return''' ''x''
 '''end function'''

'''Path halving''' works similarly but replaces only every other parent pointer:

 '''function''' Find(''x'') '''is'''
     '''while''' ''x''.parent ≠ ''x'' '''do'''
         ''x''.parent := ''x''.parent.parent
         ''x'' := ''x''.parent
     '''end while'''
     '''return''' ''x''
 '''end function'''

=== Merging two sets ===

[[File:Dsu disjoint sets init.svg|thumb|360px|<code>MakeSet</code> creates 8 singletons.]]
[[File:Dsu disjoint sets final.svg|thumb|360px|After some operations of <code>Union</code>, some sets are grouped together.]]

The operation <code>Union(''x'', ''y'')</code> replaces the set containing {{mvar|x}} and the set containing {{mvar|y}} with their union.  <code>Union</code> first uses <code>Find</code> to determine the roots of the trees containing {{mvar|x}} and {{mvar|y}}.  If the roots are the same, there is nothing more to do.  Otherwise, the two trees must be merged.  This is done by either setting the parent pointer of {{mvar|x}}'s root to {{mvar|y}}'s, or setting the parent pointer of {{mvar|y}}'s root to {{mvar|x}}'s.

The choice of which node becomes the parent has consequences for the complexity of future operations on the tree. If it is done carelessly, trees can become excessively tall.  For example, suppose that <code>Union</code> always made the tree containing {{mvar|x}} a subtree of the tree containing {{mvar|y}}.  Begin with a forest that has just been initialized with elements <math>1, 2, 3, \ldots, n,</math> and execute <code>{{math|Union(1, 2)}}</code>, <code>{{math|Union(2, 3)}}</code>, ..., <code>{{math|Union(''n'' - 1, ''n'')}}</code>.  The resulting forest contains a single tree whose root is {{mvar|n}}, and the path from 1 to {{mvar|n}} passes through every node in the tree.  For this forest, the time to run <code>Find(1)</code> is {{math|''O''(''n'')}}.

In an efficient implementation, tree height is controlled using '''union by size''' or '''union by rank'''.  Both of these require a node to store information besides just its parent pointer.  This information is used to decide which root becomes the new parent.  Both strategies ensure that trees do not become too deep.

====Union by size====

In the case of union by size, a node stores its size, which is simply its number of descendants (including the node itself).  When the trees with roots {{mvar|x}} and {{mvar|y}} are merged, the node with more descendants becomes the parent.  If the two nodes have the same number of descendants, then either one can become the parent.  In both cases, the size of the new parent node is set to its new total number of descendants.

 '''function''' Union(''x'', ''y'') '''is'''
     ''// Replace nodes by roots''
     ''x'' := Find(''x'')
     ''y'' := Find(''y'')
 
     '''if''' ''x'' = ''y'' '''then'''
         '''return'''  ''// x and y are already in the same set''
     '''end if'''
 
     ''// If necessary, swap variables to ensure that''
     ''// x has at least as many descendants as y''
     '''if''' ''x''.size < ''y''.size '''then'''
         (''x'', ''y'') := (''y'', ''x'')
     '''end if'''
 
     ''// Make x the new root''
     ''y''.parent := ''x''
     ''// Update the size of x''
     ''x''.size := ''x''.size + ''y''.size
 '''end function'''

The number of bits necessary to store the size is clearly the number of bits necessary to store {{mvar|n}}.  This adds a constant factor to the forest's required storage.

====Union by rank====

For union by rank, a node stores its {{em|rank}}, which is an upper bound for its height.  When a node is initialized, its rank is set to zero.  To merge trees with roots {{mvar|x}} and {{mvar|y}}, first compare their ranks.  If the ranks are different, then the larger rank tree becomes the parent, and the ranks of {{mvar|x}} and {{mvar|y}} do not change.  If the ranks are the same, then either one can become the parent, but the new parent's rank is incremented by one.  While the rank of a node is clearly related to its height, storing ranks is more efficient than storing heights.  The height of a node can change during a <code>Find</code> operation, so storing ranks avoids the extra effort of keeping the height correct.  In pseudocode, union by rank is:

 '''function''' Union(''x'', ''y'') '''is'''
     ''// Replace nodes by roots''
     ''x'' := Find(''x'')
     ''y'' := Find(''y'')
 
     '''if''' ''x'' = ''y'' '''then'''
         '''return'''  ''// x and y are already in the same set''
     '''end if'''
 
     ''// If necessary, rename variables to ensure that''
     ''// x has rank at least as large as that of y''
     '''if''' ''x''.rank < ''y''.rank '''then'''
         (''x'', ''y'') := (''y'', ''x'')
     '''end if'''
 
     ''// Make x the new root''
     ''y''.parent := ''x''
     ''// If necessary, increment the rank of x''
     '''if''' ''x''.rank = ''y''.rank '''then'''
         ''x''.rank := ''x''.rank + 1
     '''end if'''
 '''end function'''

It can be shown that every node has rank <math>\lfloor \log n \rfloor</math> or less.<ref name="Cormen2009"/>  Consequently each rank can be stored in {{math|''O''(log log ''n'')}} bits and all the ranks can be stored in {{math|''O''(''n'' log log ''n'')}} bits. This makes the ranks an asymptotically negligible portion of the forest's size.

It is clear from the above implementations that the size and rank of a node do not matter unless a node is the root of a tree.  Once a node becomes a child, its size and rank are never accessed again.

== Time complexity ==

A disjoint-set forest implementation in which <code>Find</code> does not update parent pointers, and in which <code>Union</code> does not attempt to control tree heights, can have trees with height {{math|''O''(''n'')}}.  In such a situation, the <code>Find</code> and <code>Union</code> operations require {{math|''O''(''n'')}} time.

If an implementation uses path compression alone, then a sequence of {{mvar|n}} <code>MakeSet</code> operations, followed by up to {{math|''n'' − 1}} <code>Union</code> operations and {{math|''f''}} <code>Find</code> operations, has a worst-case running time of <math>\Theta(n+f \cdot \left(1 + \log_{2+f/n} n\right))</math>.<ref name="Cormen2009">{{cite book|first1=Thomas H.|last1=Cormen| author1-link=Thomas H. Cormen|first2=Charles E.|last2=Leiserson|author2-link=Charles E. Leiserson|first3=Ronald L.|last3=Rivest| author3-link=Ronald L. Rivest|first4=Clifford|last4=Stein|author4-link=Clifford Stein|title=Introduction to Algorithms| edition=Third|publisher=MIT Press|chapter=Chapter 21: Data structures for Disjoint Sets|pages=571–572| isbn=978-0-262-03384-8| year=2009|title-link=Introduction to Algorithms }}</ref>

Using union by rank, but without updating parent pointers during <code>Find</code>, gives a running time of <math>\Theta(m \log n)</math> for {{mvar|m}} operations of any type, up to {{mvar|n}} of which are <code>MakeSet</code> operations.<ref name="Cormen2009"/>

The combination of path compression, splitting, or halving, with union by size or by rank, reduces the running time for {{mvar|m}} operations of any type, up to {{mvar|n}} of which are <code>MakeSet</code> operations, to <math>\Theta(m\alpha(n))</math>.<ref name="Tarjan1984"/><ref name="Tarjan1979"/>  This makes the [[amortized analysis|amortized running time]] of each operation <math>\Theta(\alpha(n))</math>.  This is asymptotically optimal, meaning that every disjoint set data structure must use <math>\Omega(\alpha(n))</math> amortized time per operation.<ref name="Fredman1989"/>  Here, the function <math>\alpha(n)</math> is the [[Ackermann function#Inverse|inverse Ackermann function]].  The inverse Ackermann function grows extraordinarily slowly, so this factor is {{math|4}} or less for any {{mvar|n}} that can actually be written in the physical universe.  This makes disjoint-set operations practically amortized constant time.

=== Proof of O(m log* n) time complexity of Union-Find ===

The precise analysis of the performance of a disjoint-set forest is somewhat intricate.  However, there is a much simpler analysis that proves that the amortized time for any {{mvar|m}} <code>Find</code> or <code>Union</code> operations on a disjoint-set forest containing {{mvar|n}} objects is {{math|''O''(''m'' log<sup>*</sup> ''n'')}}, where {{math|log<sup>*</sup>}} denotes the [[iterated logarithm]].<ref>[[Raimund Seidel]], Micha Sharir. "Top-down analysis of path compression", SIAM J. Comput. 34(3):515–525, 2005</ref><ref>{{cite journal|last1=Tarjan|first1=Robert Endre|year=1975|title=Efficiency of a Good But Not Linear Set Union Algorithm|url=http://portal.acm.org/citation.cfm?id=321884|journal=Journal of the ACM|volume=22| issue=2| pages=215–225 | doi=10.1145/321879.321884|hdl=1813/5942|s2cid=11105749|hdl-access=free}}</ref><ref>{{cite journal| last1=Hopcroft|first1=J. E.| last2=Ullman|first2=J. D.|year=1973|title=Set Merging Algorithms|journal=SIAM Journal on Computing|volume=2| issue=4| pages=294–303 | doi=10.1137/0202024}}</ref><ref>[[Robert E. Tarjan]] and [[Jan van Leeuwen]]. Worst-case analysis of set union algorithms. Journal of the ACM, 31(2):245–281, 1984.</ref>

{{anchor|increasing rank lemma}}Lemma 1: As the [[#Disjoint-set forests|find function]] follows the path along to the root, the rank of node it encounters is increasing.

{{math proof| We claim that as Find and Union operations are applied to the data set, this fact remains true over time. Initially when each node is the root of its own tree, it's trivially true. The only case when the rank of a node might be changed is when the [[#Disjoint-set forests|Union by Rank]] operation is applied. In this case, a tree with smaller rank will be attached to a tree with greater rank, rather than vice versa. And during the find operation, all nodes visited along the path will be attached to the root, which has larger rank than its children, so this operation won't change this fact either.}}

{{anchor|min subtree size lemma}}Lemma 2: A node {{mvar|u}} which is root of a subtree with rank {{mvar|r}} has at least <math>2^r</math> nodes.

{{math proof| Initially when each node is the root of its own tree, it's trivially true. Assume that a node {{mvar|u}} with rank {{mvar|r}} has at least {{math|2<sup>''r''</sup>}} nodes. Then when two trees with rank {{mvar|r}} are merged using the operation [[#Disjoint-set forests|Union by Rank]], a tree with rank {{math|''r'' + 1}} results, the root of which has at least <math>2^r + 2^r = 2^{r + 1}</math> nodes.}}

[[File:ProofOflogstarnRank.jpg|center]]
Lemma 3: The maximum number of nodes of rank {{mvar|r}} is at most <math>\frac{n}{2^r}.</math>

{{math proof| From [[#min subtree size lemma|lemma 2]], we know that a node {{mvar|u}} which is root of a subtree with rank {{mvar|r}} has at least <math>2^r</math> nodes. We will get the maximum number of nodes of rank {{mvar|r}} when each node with rank {{mvar|r}} is the root of a tree that has exactly <math>2^r</math> nodes. In this case, the number of nodes of rank {{mvar|r}} is <math>\frac{n}{2^r}.</math>}}

At any particular point in the execution, we can  group the vertices of the graph into "buckets", according to their rank. We define the buckets' ranges inductively, as follows: Bucket 0 contains vertices of rank 0. Bucket 1 contains vertices of rank 1. Bucket 2 contains vertices of ranks 2 and 3. 
In general, if the {{mvar|B}}-th bucket contains vertices with ranks from interval <math>\left[r, 2^r - 1\right] = [r, R - 1]</math>, then the (B+1)st bucket will contain vertices with ranks from interval <math>\left[R, 2^R - 1\right].</math>


For <math>B \in \mathbb{N}</math>, let <math>\text{tower}(B) = \underbrace{2^{2^{\cdots^2}}}_{B \text{ times}}</math>. Then
bucket <math>B</math> will have vertices with ranks in the interval <math>[\text{tower}(B-1), \text{tower}(B)-1]</math>.

[[File:Proof_of_O(log*n)_Union_Find.jpg|center|frame|Proof of <math>O(\log^*n)</math> Union Find]]
We can make two observations about the buckets' sizes.

# {{anchor|max buckets}}The total number of buckets is at most {{math|log<sup>*</sup>''n''}}.
#: Proof: Since no vertex can have rank greater than <math>n</math>, only the first <math>\log^* (n)</math> buckets can have vertices, where <math>\log^*</math> denotes the inverse of the <math>\text{tower}</math> function defined above.
# {{anchor|max bucket size}}The maximum number of elements in bucket <math>\left[B, 2^B - 1\right]</math> is at most <math>\frac{2n}{2^B}</math>.
#: Proof: The maximum number of elements in bucket <math>\left[B, 2^B - 1\right]</math> is at most <math>\frac{n}{2^B} + \frac{n}{2^{B+1}} + \frac{n}{2^{B+2}} + \cdots + \frac{n}{2^{2^B - 1}} \leq \frac{2n}{2^B}.</math>

Let {{mvar|F}} represent the list of "find" operations performed, and let

<math display=block>T_1 = \sum_F\text{(link to the root)}</math>
<math display=block>T_2 = \sum_F\text{(number of links traversed where the buckets are different)}</math>
<math display=block>T_3 = \sum_F\text{(number of links traversed where the buckets are the same).}</math>

Then the total cost of {{mvar|m}} finds is <math>T = T_1 + T_2 + T_3.</math>

Since each find operation makes exactly one traversal that leads to a root, we have {{math|1=''T''<sub>1</sub> = ''O''(''m'')}}.

Also, from the bound above on the number of buckets, we have {{math|1=''T''<sub>2</sub> = ''O''(''m''log<sup>*</sup>''n'')}}.

For {{mvar|T<sub>3</sub>}}, suppose we are traversing an edge from {{mvar|u}} to {{mvar|v}}, where {{mvar|u}} and {{mvar|v}} have rank in the bucket {{math|[''B'', 2<sup>''B''</sup> − 1]}} and {{mvar|v}} is not the root (at the time of this traversing, otherwise the traversal would be accounted for in {{mvar|T<sub>1</sub>}}). Fix {{mvar|u}} and consider the sequence <math>v_1, v_2, \ldots, v_k</math> that take the role of {{mvar|v}} in different find operations. Because of path compression and not accounting for the edge to a root, this sequence contains only different nodes and because of [[#increasing rank lemma|Lemma 1]] we know that the ranks of the nodes in this sequence are strictly increasing. By both of the nodes being in the bucket we can conclude that the length {{mvar|k}} of the sequence (the number of times node {{mvar|u}} is attached to a different root in the same bucket) is at most the number of ranks in the buckets {{mvar|B}}, that is, at most <math>2^B - 1 - B < 2^B.</math>

Therefore, <math>T_3 \leq \sum_{[B, 2^B - 1]} \sum_u 2^B.</math>

From Observations [[#max buckets|1]] and [[#max bucket size|2]], we can conclude that <math display="inline">T_3 \leq \sum_{B} 2^B \frac{2n}{2^B} \leq 2 n \log^* n.</math>

Therefore, <math>T = T_1 + T_2 + T_3 = O(m \log^*n).</math>

== Other structures ==

===Better worst-case time per operation===
The worst-case time of the <code>Find</code> operation in trees with '''Union by rank''' or '''Union by weight''' is <math>\Theta(\log n)</math> (i.e., it is <math>O(\log n)</math> and this bound is tight). 
In 1985, N. Blum gave an implementation of the operations that does not use path compression, but compresses trees during <math>union</math>. His implementation runs in <math>O(\log n / \log\log n)</math> time per operation,<ref>{{cite journal |last1=Blum |first1=Norbert |title=On the Single-Operation Worst-Case Time Complexity of the Disjoint Set Union Problem |journal=2nd Symp. On Theoretical Aspects of Computer Science |date=1985 |pages=32–38}}</ref> and thus in comparison with Galler and Fischer's structure it has a better worst-case time per operation, but inferior amortized time. In 1999, Alstrup et al. gave a structure that has optimal worst-case
time <math>O(\log n / \log\log n)</math> together with inverse-Ackermann amortized time.<ref>{{cite book |last1=Alstrup |first1=Stephen |last2=Ben-Amram |first2=Amir M. |last3=Rauhe |first3=Theis  |title=Proceedings of the thirty-first annual ACM symposium on Theory of Computing |chapter=Worst-case and amortised optimality in union-find (Extended abstract) |date=1999 |pages=499–506 |doi=10.1145/301250.301383|isbn=1581130678 |s2cid=100111 }}</ref>

===Deletion===
The regular implementation as disjoint-set forests does not react favorably to the deletion of elements,
in the sense that the time for <code>Find</code> will not improve as a result of the decrease in the number of elements. However, there exist modern implementations that allow for constant-time deletion and where the time-bound for <code>Find</code> depends on the ''current'' number of elements<ref>{{cite journal |last1=Alstrup |first1=Stephen |last2=Thorup |first2=Mikkel |last3=Gørtz |first3=Inge Li |last4=Rauhe |first4=Theis |last5=Zwick |first5=Uri |title=Union-Find with Constant Time Deletions |journal= ACM Transactions on Algorithms|date=2014 |volume=11 |issue=1 |pages=6:1–6:28|doi=10.1145/2636922 |s2cid=12767012 }}</ref><ref>{{cite journal |last1=Ben-Amram |first1=Amir M. |last2=Yoffe |first2=Simon |title=A simple and efficient Union-Find-Delete algorithm |journal=Theoretical Computer Science |date=2011 |volume=412 |issue=4–5 |pages=487–492|doi=10.1016/j.tcs.2010.11.005 }}</ref>

== Applications ==

[[File:UnionFindKruskalDemo.gif|250px|thumb|A demo for Union-Find when using Kruskal's algorithm to find minimum spanning tree.]]

Disjoint-set data structures model the [[Partition of a set|partitioning of a set]], for example to keep track of the [[Connected component (graph theory)|connected components]] of an [[undirected graph]]. This model can then be used to determine whether two vertices belong to the same component, or whether adding an edge between them would result in a cycle.  The Union–Find algorithm is used in high-performance implementations of [[Unification (computer science)|unification]].<ref name="Knight1989">{{cite journal|last1=Knight|first1=Kevin|year=1989|title=Unification: A multidisciplinary survey|journal=ACM Computing Surveys|pages=93&ndash;124|doi=10.1145/62029.62030|volume=21|s2cid=14619034|url=http://www.isi.edu/natural-language/people/unification-knight.pdf }}</ref>

This data structure is used by the [[Boost Graph Library]] to implement its [http://www.boost.org/libs/graph/doc/incremental_components.html Incremental Connected Components] functionality. It is also a key component in implementing [[Kruskal's algorithm]] to find the [[minimum spanning tree]] of a graph.

The [[Hoshen–Kopelman algorithm|Hoshen-Kopelman algorithm]] uses a Union-Find in the algorithm.

== See also ==

* {{annotated link|Partition refinement}}, a different data structure for maintaining disjoint sets, with updates that split sets apart rather than merging them together
* {{annotated link|Dynamic connectivity}}

== References ==

{{reflist|30em}}

== External links ==

* [https://www.boost.org/doc/libs/1_31_0/libs/disjoint_sets/disjoint_sets.html C++ implementation], part of the [[Boost C++ libraries]]
* [https://github.com/jgrapht/jgrapht/blob/master/jgrapht-core/src/main/java/org/jgrapht/alg/util/UnionFind.java Java implementation], part of [https://jgrapht.org/ JGraphT library]
* [https://the-algorithms.com/algorithm/union-find Javascript implementation]
* [http://code.activestate.com/recipes/215912-union-find-data-structure/ Python implementation]

{{data structures}}

[[Category:Search algorithms]]
[[Category:Amortized data structures]]
[[Category:Articles with example pseudocode]]