Editing Topological sorting (section)

== Algorithms ==
The usual algorithms for topological sorting have running time linear in the number of nodes plus the number of edges, asymptotically, <math>O(\left|{V}\right| + \left|{E}\right|).</math>

===Kahn's algorithm===
{{Distinguish|Kuhn's algorithm}}
One of these algorithms, first described by {{harvp|Kahn|1962}}, works by choosing vertices in the same order as the eventual topological sort.{{r|Kahn}} First, find a list of "start nodes" that have no incoming edges and insert them into a set S; at least one such node must exist in a non-empty (finite) acyclic graph. Then:

 ''L'' ← Empty list that will contain the sorted elements
 ''S'' ← Set of all nodes with no incoming edge
 
 '''while''' ''S'' '''is not''' empty '''do'''
     remove a node ''n'' from ''S''
     add ''n'' to ''L''
     '''for each''' node ''m'' with an edge ''e'' from ''n'' to ''m'' '''do'''
         remove edge ''e'' from the graph
         '''if''' ''m'' has no other incoming edges '''then'''
             insert ''m'' into ''S''
 
 '''if''' ''graph'' has edges '''then'''
     '''return''' error   ''(graph has at least one cycle)''
 '''else''' 
     '''return''' ''L''   ''(a topologically sorted order)''

If the graph is a [[Directed acyclic graph|DAG]], a solution will be contained in the list L (although the solution is not necessarily unique). Otherwise, the graph must have at least one cycle and therefore a topological sort is impossible.

Reflecting the non-uniqueness of the resulting sort, the structure S can be simply a set or a queue or a stack. Depending on the order that nodes n are removed from set S, a different solution is created. A variation of Kahn's algorithm that breaks ties [[lexicographic order|lexicographically]] forms a key component of the [[Coffman–Graham algorithm]] for parallel scheduling and [[layered graph drawing]].

===Depth-first search===
An alternative algorithm for topological sorting is based on [[depth-first search]]. The algorithm loops through each node of the graph, in an arbitrary order, initiating a depth-first search that terminates when it hits any node that has already been visited since the beginning of the topological sort or the node has no outgoing edges (i.e., a leaf node):

 ''L'' ← Empty list that will contain the sorted nodes
 '''while''' exists nodes without a permanent mark '''do'''
     select an unmarked node ''n''
     visit(''n'')
 
 '''function''' visit(node ''n'')
     '''if''' ''n'' has a permanent mark '''then'''
         '''return'''
     '''if''' ''n'' has a temporary mark '''then'''
         '''stop'''   ''(graph has at least one cycle)''
 
     mark ''n'' with a temporary mark
 
     '''for each''' node ''m'' with an edge from ''n'' to ''m'' '''do'''
         visit(''m'')
 
     mark ''n'' with a permanent mark
     add ''n'' to head of ''L''

Each node ''n'' gets ''prepended'' to the output list L only after considering all other nodes that depend on ''n'' (all descendants of ''n'' in the graph).  Specifically, when the algorithm adds node ''n'', we are guaranteed that all nodes that depend on ''n'' are already in the output list L: they were added to L either by the recursive call to visit() that ended before the call to visit ''n'', or by a call to visit() that started even before the call to visit ''n''.  Since each edge and node is visited once, the algorithm runs in linear time. This depth-first-search-based algorithm is the one described by {{harvp|Cormen|Leiserson|Rivest|Stein|2001}};{{r|CLRS}} it seems to have been first described in print by Tarjan in 1976.{{r|Tarjan}}

===Parallel algorithms===
On a [[parallel random-access machine]], a topological ordering can be constructed in ''O''((log ''n'')<sup>2</sup>) time using a polynomial number of processors, putting the problem into the complexity class '''[[NC (complexity)|NC]]<sup>2</sup>'''.{{r|Cook}}
One method for doing this is to repeatedly square the [[adjacency matrix]] of the given graph, logarithmically many times, using [[min-plus matrix multiplication]] with maximization in place of minimization. The resulting matrix describes the [[Longest path problem|longest path]] distances in the graph. Sorting the vertices by the lengths of their longest incoming paths produces a topological ordering.{{r|DNS}}

An algorithm for parallel topological sorting on [[distributed memory]] machines parallelizes the algorithm of Kahn for a [[Directed acyclic graph|DAG]] <math>G = (V, E)</math>.{{r|SMDD}} On a high level, the algorithm of Kahn repeatedly removes the vertices of indegree 0 and adds them to the topological sorting in the order in which they were removed. Since the outgoing edges of the removed vertices are also removed, there will be a new set of vertices of indegree 0, where the procedure is repeated until no vertices are left. This algorithm performs <math>D+1</math> iterations, where {{mvar|D}} is the longest path in {{mvar|G}}. Each iteration can be parallelized, which is the idea of the following algorithm.

In the following, it is assumed that the graph partition is stored on {{mvar|p}} processing elements (PE), which are labeled <math>0, \dots, p-1</math>. Each PE {{mvar|i}} initializes a set of local vertices <math>Q_i^1</math> with [[indegree]] 0, where the upper index represents the current iteration. Since all vertices in the local sets <math>Q_0^1, \dots, Q_{p-1}^1</math> have indegree 0, i.e., they are not adjacent, they can be given in an arbitrary order for a valid topological sorting. To assign a global index to each vertex, a [[prefix sum]] is calculated over the sizes of <math>Q_0^1, \dots, Q_{p-1}^1</math>. So, each step, there are <math display="inline">\sum_{i=0}^{p-1} |Q_i|</math> vertices added to the topological sorting.
[[File:Parallel_Topological_Sorting.gif|thumb|upright=1.35|Execution of the parallel topological sorting algorithm on a DAG with two processing elements.]]
In the first step, PE {{mvar|j}} assigns the indices <math display="inline">\sum_{i=0}^{j-1} |Q_i^1|, \dots, \left(\sum_{i=0}^{j} |Q_i^1|\right) - 1</math> to the local vertices in <math>Q_j^1</math>. These vertices in <math>Q_j^1</math> are removed, together with their corresponding outgoing edges. For each outgoing edge <math>(u, v)</math> with endpoint {{mvar|v}} in another PE <math>l, j \neq l</math>, the message <math>(u, v)</math> is posted to PE {{mvar|l}}. After all vertices in <math>Q_j^1</math> are removed, the posted messages are sent to their corresponding PE. Each message <math>(u, v)</math> received updates the indegree of the local vertex {{mvar|v}}. If the indegree drops to zero, {{mvar|v}} is added to <math>Q_j^2</math>. Then the next iteration starts.

In step {{mvar|k}}, PE {{mvar|j}} assigns the indices <math display="inline">a_{k-1} + \sum_{i=0}^{j-1} |Q_i^k|, \dots, a_{k-1} + \left(\sum_{i=0}^{j} |Q_i^k|\right) - 1</math>, where  <math>a_{k-1}</math> is the total number of processed vertices after step {{tmath|k-1}}. This procedure repeats until there are no vertices left to process, hence <math display="inline">\sum_{i=0}^{p-1} |Q_i^{D+1}| = 0</math>. Below is a high level, [[SPMD|single program, multiple data]] pseudo-code overview of this algorithm.

Note that the [[Prefix sum#Parallel algorithms|prefix sum]] for the local offsets <math display="inline">a_{k-1} + \sum_{i=0}^{j-1} |Q_i^k|, \dots, a_{k-1} + \left(\sum_{i=0}^{j} |Q_i^k|\right) - 1</math> can be efficiently calculated in parallel.
 '''p''' processing elements with IDs from 0 to ''p''-1
 '''Input:''' G = (V, E) DAG, distributed to PEs, PE index j = 0, ..., p - 1
 '''Output:''' topological sorting of G
 
 '''function''' traverseDAGDistributed
     δ incoming degree of local vertices ''V''
     {{math|1=''Q'' = {{mset|''v'' ∈ ''V'' | δ[''v''] {{=}} 0}}}}                     // All vertices with indegree 0
     nrOfVerticesProcessed = 0
 
     '''do'''
         '''global''' build prefix sum over size of ''Q''     // get offsets and total number of vertices in this step
         offset = nrOfVerticesProcessed + sum(Q<sub>i</sub>, i = 0 to j - 1)          // ''j'' is the processor index
         '''foreach''' u '''in''' Q
             localOrder[u] = index++;
             '''foreach''' (u,v) in E '''do''' post message (''u, v'') to PE owning vertex ''v''
         nrOfVerticesProcessed += sum(|Q<sub>i</sub>|, i = 0 to p - 1)
         deliver all messages to neighbors of vertices in Q
         receive messages for local vertices V
         remove all vertices in Q
         '''foreach''' message (''u, v'') received:
             '''if''' --δ[v] = 0
                 add ''v'' to ''Q''
     '''while''' global size of ''Q'' > 0
 
     '''return''' localOrder
The communication cost depends heavily on the given graph partition. As for runtime, on a [[CRCW PRAM|CRCW-PRAM]] model that allows fetch-and-decrement in constant time, this algorithm runs in <math display="inline">\mathcal{O} \left(\frac{m + n}{p} + D (\Delta + \log n)\right)</math>, where {{mvar|D}} is again the longest path in {{mvar|G}} and {{mvar|Δ}} the maximum degree.{{r|SMDD}}