Editing Disjoint-set data structure (section)

== Time complexity ==

A disjoint-set forest implementation in which <code>Find</code> does not update parent pointers, and in which <code>Union</code> does not attempt to control tree heights, can have trees with height {{math|''O''(''n'')}}.  In such a situation, the <code>Find</code> and <code>Union</code> operations require {{math|''O''(''n'')}} time.

If an implementation uses path compression alone, then a sequence of {{mvar|n}} <code>MakeSet</code> operations, followed by up to {{math|''n'' − 1}} <code>Union</code> operations and {{math|''f''}} <code>Find</code> operations, has a worst-case running time of <math>\Theta(n+f \cdot \left(1 + \log_{2+f/n} n\right))</math>.<ref name="Cormen2009">{{cite book|first1=Thomas H.|last1=Cormen| author1-link=Thomas H. Cormen|first2=Charles E.|last2=Leiserson|author2-link=Charles E. Leiserson|first3=Ronald L.|last3=Rivest| author3-link=Ronald L. Rivest|first4=Clifford|last4=Stein|author4-link=Clifford Stein|title=Introduction to Algorithms| edition=Third|publisher=MIT Press|chapter=Chapter 21: Data structures for Disjoint Sets|pages=571–572| isbn=978-0-262-03384-8| year=2009|title-link=Introduction to Algorithms }}</ref>

Using union by rank, but without updating parent pointers during <code>Find</code>, gives a running time of <math>\Theta(m \log n)</math> for {{mvar|m}} operations of any type, up to {{mvar|n}} of which are <code>MakeSet</code> operations.<ref name="Cormen2009"/>

The combination of path compression, splitting, or halving, with union by size or by rank, reduces the running time for {{mvar|m}} operations of any type, up to {{mvar|n}} of which are <code>MakeSet</code> operations, to <math>\Theta(m\alpha(n))</math>.<ref name="Tarjan1984"/><ref name="Tarjan1979"/>  This makes the [[amortized analysis|amortized running time]] of each operation <math>\Theta(\alpha(n))</math>.  This is asymptotically optimal, meaning that every disjoint set data structure must use <math>\Omega(\alpha(n))</math> amortized time per operation.<ref name="Fredman1989"/>  Here, the function <math>\alpha(n)</math> is the [[Ackermann function#Inverse|inverse Ackermann function]].  The inverse Ackermann function grows extraordinarily slowly, so this factor is {{math|4}} or less for any {{mvar|n}} that can actually be written in the physical universe.  This makes disjoint-set operations practically amortized constant time.

=== Proof of O(m log* n) time complexity of Union-Find ===

The precise analysis of the performance of a disjoint-set forest is somewhat intricate.  However, there is a much simpler analysis that proves that the amortized time for any {{mvar|m}} <code>Find</code> or <code>Union</code> operations on a disjoint-set forest containing {{mvar|n}} objects is {{math|''O''(''m'' log<sup>*</sup> ''n'')}}, where {{math|log<sup>*</sup>}} denotes the [[iterated logarithm]].<ref>[[Raimund Seidel]], Micha Sharir. "Top-down analysis of path compression", SIAM J. Comput. 34(3):515–525, 2005</ref><ref>{{cite journal|last1=Tarjan|first1=Robert Endre|year=1975|title=Efficiency of a Good But Not Linear Set Union Algorithm|url=http://portal.acm.org/citation.cfm?id=321884|journal=Journal of the ACM|volume=22| issue=2| pages=215–225 | doi=10.1145/321879.321884|hdl=1813/5942|s2cid=11105749|hdl-access=free}}</ref><ref>{{cite journal| last1=Hopcroft|first1=J. E.| last2=Ullman|first2=J. D.|year=1973|title=Set Merging Algorithms|journal=SIAM Journal on Computing|volume=2| issue=4| pages=294–303 | doi=10.1137/0202024}}</ref><ref>[[Robert E. Tarjan]] and [[Jan van Leeuwen]]. Worst-case analysis of set union algorithms. Journal of the ACM, 31(2):245–281, 1984.</ref>

{{anchor|increasing rank lemma}}Lemma 1: As the [[#Disjoint-set forests|find function]] follows the path along to the root, the rank of node it encounters is increasing.

{{math proof| We claim that as Find and Union operations are applied to the data set, this fact remains true over time. Initially when each node is the root of its own tree, it's trivially true. The only case when the rank of a node might be changed is when the [[#Disjoint-set forests|Union by Rank]] operation is applied. In this case, a tree with smaller rank will be attached to a tree with greater rank, rather than vice versa. And during the find operation, all nodes visited along the path will be attached to the root, which has larger rank than its children, so this operation won't change this fact either.}}

{{anchor|min subtree size lemma}}Lemma 2: A node {{mvar|u}} which is root of a subtree with rank {{mvar|r}} has at least <math>2^r</math> nodes.

{{math proof| Initially when each node is the root of its own tree, it's trivially true. Assume that a node {{mvar|u}} with rank {{mvar|r}} has at least {{math|2<sup>''r''</sup>}} nodes. Then when two trees with rank {{mvar|r}} are merged using the operation [[#Disjoint-set forests|Union by Rank]], a tree with rank {{math|''r'' + 1}} results, the root of which has at least <math>2^r + 2^r = 2^{r + 1}</math> nodes.}}

[[File:ProofOflogstarnRank.jpg|center]]
Lemma 3: The maximum number of nodes of rank {{mvar|r}} is at most <math>\frac{n}{2^r}.</math>

{{math proof| From [[#min subtree size lemma|lemma 2]], we know that a node {{mvar|u}} which is root of a subtree with rank {{mvar|r}} has at least <math>2^r</math> nodes. We will get the maximum number of nodes of rank {{mvar|r}} when each node with rank {{mvar|r}} is the root of a tree that has exactly <math>2^r</math> nodes. In this case, the number of nodes of rank {{mvar|r}} is <math>\frac{n}{2^r}.</math>}}

At any particular point in the execution, we can  group the vertices of the graph into "buckets", according to their rank. We define the buckets' ranges inductively, as follows: Bucket 0 contains vertices of rank 0. Bucket 1 contains vertices of rank 1. Bucket 2 contains vertices of ranks 2 and 3. 
In general, if the {{mvar|B}}-th bucket contains vertices with ranks from interval <math>\left[r, 2^r - 1\right] = [r, R - 1]</math>, then the (B+1)st bucket will contain vertices with ranks from interval <math>\left[R, 2^R - 1\right].</math>


For <math>B \in \mathbb{N}</math>, let <math>\text{tower}(B) = \underbrace{2^{2^{\cdots^2}}}_{B \text{ times}}</math>. Then
bucket <math>B</math> will have vertices with ranks in the interval <math>[\text{tower}(B-1), \text{tower}(B)-1]</math>.

[[File:Proof_of_O(log*n)_Union_Find.jpg|center|frame|Proof of <math>O(\log^*n)</math> Union Find]]
We can make two observations about the buckets' sizes.

# {{anchor|max buckets}}The total number of buckets is at most {{math|log<sup>*</sup>''n''}}.
#: Proof: Since no vertex can have rank greater than <math>n</math>, only the first <math>\log^* (n)</math> buckets can have vertices, where <math>\log^*</math> denotes the inverse of the <math>\text{tower}</math> function defined above.
# {{anchor|max bucket size}}The maximum number of elements in bucket <math>\left[B, 2^B - 1\right]</math> is at most <math>\frac{2n}{2^B}</math>.
#: Proof: The maximum number of elements in bucket <math>\left[B, 2^B - 1\right]</math> is at most <math>\frac{n}{2^B} + \frac{n}{2^{B+1}} + \frac{n}{2^{B+2}} + \cdots + \frac{n}{2^{2^B - 1}} \leq \frac{2n}{2^B}.</math>

Let {{mvar|F}} represent the list of "find" operations performed, and let

<math display=block>T_1 = \sum_F\text{(link to the root)}</math>
<math display=block>T_2 = \sum_F\text{(number of links traversed where the buckets are different)}</math>
<math display=block>T_3 = \sum_F\text{(number of links traversed where the buckets are the same).}</math>

Then the total cost of {{mvar|m}} finds is <math>T = T_1 + T_2 + T_3.</math>

Since each find operation makes exactly one traversal that leads to a root, we have {{math|1=''T''<sub>1</sub> = ''O''(''m'')}}.

Also, from the bound above on the number of buckets, we have {{math|1=''T''<sub>2</sub> = ''O''(''m''log<sup>*</sup>''n'')}}.

For {{mvar|T<sub>3</sub>}}, suppose we are traversing an edge from {{mvar|u}} to {{mvar|v}}, where {{mvar|u}} and {{mvar|v}} have rank in the bucket {{math|[''B'', 2<sup>''B''</sup> − 1]}} and {{mvar|v}} is not the root (at the time of this traversing, otherwise the traversal would be accounted for in {{mvar|T<sub>1</sub>}}). Fix {{mvar|u}} and consider the sequence <math>v_1, v_2, \ldots, v_k</math> that take the role of {{mvar|v}} in different find operations. Because of path compression and not accounting for the edge to a root, this sequence contains only different nodes and because of [[#increasing rank lemma|Lemma 1]] we know that the ranks of the nodes in this sequence are strictly increasing. By both of the nodes being in the bucket we can conclude that the length {{mvar|k}} of the sequence (the number of times node {{mvar|u}} is attached to a different root in the same bucket) is at most the number of ranks in the buckets {{mvar|B}}, that is, at most <math>2^B - 1 - B < 2^B.</math>

Therefore, <math>T_3 \leq \sum_{[B, 2^B - 1]} \sum_u 2^B.</math>

From Observations [[#max buckets|1]] and [[#max bucket size|2]], we can conclude that <math display="inline">T_3 \leq \sum_{B} 2^B \frac{2n}{2^B} \leq 2 n \log^* n.</math>

Therefore, <math>T = T_1 + T_2 + T_3 = O(m \log^*n).</math>