Editing Disjoint-set data structure (section)

== Operations ==

Disjoint-set data structures support three operations: Making a new set containing a new element; Finding the representative of the set containing a given element; and Merging two sets.

=== Making new sets ===

The <code>MakeSet</code> operation adds a new element into a new set containing only the new element, and the new set is added to the data structure.  If the data structure is instead viewed as a partition of a set, then the <code>MakeSet</code> operation enlarges the set by adding the new element, and it extends the existing partition by putting the new element into a new subset containing only the new element.

In a disjoint-set forest, <code>MakeSet</code> initializes the node's parent pointer and the node's size or rank.  If a root is represented by a node that points to itself, then adding an element can be described using the following pseudocode:

 '''function''' MakeSet(''x'') '''is'''
     '''if''' ''x'' is not already in the forest '''then'''
         ''x''.parent := ''x''
         ''x''.size := 1     ''// if nodes store size''
         ''x''.rank := 0     ''// if nodes store rank''
     '''end if'''
 '''end function'''

This operation has linear time complexity.  In particular, initializing a
disjoint-set forest with {{mvar|n}} nodes requires {{math|''O''(''n'')}}
time.

Lack of a parent assigned to the node implies that the node is not present in the forest.

In practice, <code>MakeSet</code> must be preceded by an operation that allocates memory to hold {{math|x}}.  As long as memory allocation is an amortized constant-time operation, as it is for a good [[dynamic array]] implementation, it does not change the asymptotic performance of the random-set forest.

=== Finding set representatives ===

The <code>Find</code> operation follows the chain of parent pointers from a specified query node {{mvar|x}} until it reaches a root element.  This root element represents the set to which {{mvar|x}} belongs and may be {{mvar|x}} itself.  <code>Find</code> returns the root element it reaches.

Performing a <code>Find</code> operation presents an important opportunity for improving the forest.  The time in a <code>Find</code> operation is spent chasing parent pointers, so a flatter tree leads to faster <code>Find</code> operations.  When a <code>Find</code> is executed, there is no faster way to reach the root than by following each parent pointer in succession.  However, the parent pointers visited during this search can be updated to point closer to the root.  Because every element visited on the way to a root is part of the same set, this does not change the sets stored in the forest.  But it makes future <code>Find</code> operations faster, not only for the nodes between the query node and the root, but also for their descendants.  This updating is an important part of the disjoint-set forest's amortized performance guarantee.

There are several algorithms for <code>Find</code> that achieve the asymptotically optimal time complexity.  One family of algorithms, known as '''path compression''', makes every node between the query node and the root point to the root.  Path compression can be implemented using a simple recursion as follows:

 '''function''' Find(''x'') '''is'''
     '''if''' ''x''.parent ≠ ''x'' '''then'''
         ''x''.parent := Find(''x''.parent)
         '''return''' ''x''.parent
     '''else'''
         '''return''' ''x''
     '''end if'''
 '''end function'''

This implementation makes two passes, one up the tree and one back down.  It requires enough scratch memory to store the path from the query node to the root (in the above pseudocode, the path is implicitly represented using the call stack).  This can be decreased to a constant amount of memory by performing both passes in the same direction.  The constant memory implementation walks from the query node to the root twice, once to find the root and once to update pointers:

 '''function''' Find(''x'') '''is'''
     ''root'' := ''x''
     '''while''' ''root''.parent ≠ ''root'' '''do'''
         ''root'' := ''root''.parent
     '''end while'''
 
     '''while''' ''x''.parent ≠ ''root'' '''do'''
         ''parent'' := ''x''.parent
         ''x''.parent := ''root''
         ''x'' := ''parent''
     '''end while'''
 
     '''return''' ''root''
 '''end function'''

[[Robert E. Tarjan|Tarjan]] and [[Jan van Leeuwen|Van Leeuwen]] also developed one-pass <code>Find</code> algorithms that retain the same worst-case complexity but are more efficient in practice.<ref name="Tarjan1984"/>  These are called path splitting and path halving.  Both of these update the parent pointers of nodes on the path between the query node and the root.  '''Path splitting''' replaces every parent pointer on that path by a pointer to the node's grandparent:

 '''function''' Find(''x'') '''is'''
     '''while''' ''x''.parent ≠ ''x'' '''do'''
         (''x'', ''x''.parent) := (''x''.parent, ''x''.parent.parent)
     '''end while'''
     '''return''' ''x''
 '''end function'''

'''Path halving''' works similarly but replaces only every other parent pointer:

 '''function''' Find(''x'') '''is'''
     '''while''' ''x''.parent ≠ ''x'' '''do'''
         ''x''.parent := ''x''.parent.parent
         ''x'' := ''x''.parent
     '''end while'''
     '''return''' ''x''
 '''end function'''

=== Merging two sets ===

[[File:Dsu disjoint sets init.svg|thumb|360px|<code>MakeSet</code> creates 8 singletons.]]
[[File:Dsu disjoint sets final.svg|thumb|360px|After some operations of <code>Union</code>, some sets are grouped together.]]

The operation <code>Union(''x'', ''y'')</code> replaces the set containing {{mvar|x}} and the set containing {{mvar|y}} with their union.  <code>Union</code> first uses <code>Find</code> to determine the roots of the trees containing {{mvar|x}} and {{mvar|y}}.  If the roots are the same, there is nothing more to do.  Otherwise, the two trees must be merged.  This is done by either setting the parent pointer of {{mvar|x}}'s root to {{mvar|y}}'s, or setting the parent pointer of {{mvar|y}}'s root to {{mvar|x}}'s.

The choice of which node becomes the parent has consequences for the complexity of future operations on the tree. If it is done carelessly, trees can become excessively tall.  For example, suppose that <code>Union</code> always made the tree containing {{mvar|x}} a subtree of the tree containing {{mvar|y}}.  Begin with a forest that has just been initialized with elements <math>1, 2, 3, \ldots, n,</math> and execute <code>{{math|Union(1, 2)}}</code>, <code>{{math|Union(2, 3)}}</code>, ..., <code>{{math|Union(''n'' - 1, ''n'')}}</code>.  The resulting forest contains a single tree whose root is {{mvar|n}}, and the path from 1 to {{mvar|n}} passes through every node in the tree.  For this forest, the time to run <code>Find(1)</code> is {{math|''O''(''n'')}}.

In an efficient implementation, tree height is controlled using '''union by size''' or '''union by rank'''.  Both of these require a node to store information besides just its parent pointer.  This information is used to decide which root becomes the new parent.  Both strategies ensure that trees do not become too deep.

====Union by size====

In the case of union by size, a node stores its size, which is simply its number of descendants (including the node itself).  When the trees with roots {{mvar|x}} and {{mvar|y}} are merged, the node with more descendants becomes the parent.  If the two nodes have the same number of descendants, then either one can become the parent.  In both cases, the size of the new parent node is set to its new total number of descendants.

 '''function''' Union(''x'', ''y'') '''is'''
     ''// Replace nodes by roots''
     ''x'' := Find(''x'')
     ''y'' := Find(''y'')
 
     '''if''' ''x'' = ''y'' '''then'''
         '''return'''  ''// x and y are already in the same set''
     '''end if'''
 
     ''// If necessary, swap variables to ensure that''
     ''// x has at least as many descendants as y''
     '''if''' ''x''.size < ''y''.size '''then'''
         (''x'', ''y'') := (''y'', ''x'')
     '''end if'''
 
     ''// Make x the new root''
     ''y''.parent := ''x''
     ''// Update the size of x''
     ''x''.size := ''x''.size + ''y''.size
 '''end function'''

The number of bits necessary to store the size is clearly the number of bits necessary to store {{mvar|n}}.  This adds a constant factor to the forest's required storage.

====Union by rank====

For union by rank, a node stores its {{em|rank}}, which is an upper bound for its height.  When a node is initialized, its rank is set to zero.  To merge trees with roots {{mvar|x}} and {{mvar|y}}, first compare their ranks.  If the ranks are different, then the larger rank tree becomes the parent, and the ranks of {{mvar|x}} and {{mvar|y}} do not change.  If the ranks are the same, then either one can become the parent, but the new parent's rank is incremented by one.  While the rank of a node is clearly related to its height, storing ranks is more efficient than storing heights.  The height of a node can change during a <code>Find</code> operation, so storing ranks avoids the extra effort of keeping the height correct.  In pseudocode, union by rank is:

 '''function''' Union(''x'', ''y'') '''is'''
     ''// Replace nodes by roots''
     ''x'' := Find(''x'')
     ''y'' := Find(''y'')
 
     '''if''' ''x'' = ''y'' '''then'''
         '''return'''  ''// x and y are already in the same set''
     '''end if'''
 
     ''// If necessary, rename variables to ensure that''
     ''// x has rank at least as large as that of y''
     '''if''' ''x''.rank < ''y''.rank '''then'''
         (''x'', ''y'') := (''y'', ''x'')
     '''end if'''
 
     ''// Make x the new root''
     ''y''.parent := ''x''
     ''// If necessary, increment the rank of x''
     '''if''' ''x''.rank = ''y''.rank '''then'''
         ''x''.rank := ''x''.rank + 1
     '''end if'''
 '''end function'''

It can be shown that every node has rank <math>\lfloor \log n \rfloor</math> or less.<ref name="Cormen2009"/>  Consequently each rank can be stored in {{math|''O''(log log ''n'')}} bits and all the ranks can be stored in {{math|''O''(''n'' log log ''n'')}} bits. This makes the ranks an asymptotically negligible portion of the forest's size.

It is clear from the above implementations that the size and rank of a node do not matter unless a node is the root of a tree.  Once a node becomes a child, its size and rank are never accessed again.