Editing Load balancing (computing) (section)

==Approaches==

===Static distribution with full knowledge of the tasks: [[prefix sum]]===
If the tasks are independent of each other, and if their respective execution time and the tasks can be subdivided, there is a simple and optimal algorithm.

[[File:Load_Balancing_divisible_tasks.png|thumb|Load balancing algorithm depending on divisibility of tasks]]

By dividing the tasks in such a way as to give the same amount of computation to each processor, all that remains to be done is to group the results together. Using a [[prefix sum]] algorithm, this division can be calculated in [[logarithmic time]] with respect to the number of processors.{{citation needed|date=October 2022}}

If, however, the tasks cannot be subdivided (i.e., they are [[Linearizability|atomic]]), although optimizing task assignment is a difficult problem, it is still possible to approximate a relatively fair distribution of tasks, provided that the size of each of them is much smaller than the total computation performed by each of the nodes.<ref name="Sequential and parallel algorithms"/>

Most of the time, the execution time of a task is unknown and only rough approximations are available. This algorithm, although particularly efficient, is not viable for these scenarios.

===Static load distribution without prior knowledge===
Even if the execution time is not known in advance at all, static load distribution is always possible.

====[[Round-robin scheduling]]====
In a round-robin algorithm, the first request is sent to the first server, then the next to the second, and so on down to the last. Then it is started again, assigning the next request to the first server, and so on.

This algorithm can be weighted such that the most powerful units receive the largest number of requests and receive them first.

====Randomized static====
Randomized static load balancing is simply a matter of randomly assigning tasks to the different servers. This method works quite well. If, on the other hand, the number of tasks is known in advance, it is even more efficient to calculate a random permutation in advance. This avoids communication costs for each assignment. There is no longer a need for a distribution master because every processor knows what task is assigned to it. Even if the number of tasks is unknown, it is still possible to avoid communication with a pseudo-random assignment generation known to all processors.

The performance of this strategy (measured in total execution time for a given fixed set of tasks) decreases with the maximum size of the tasks.

====Others====
Of course, there are other methods of assignment as well: 
* Less work: Assign more tasks to the servers by performing less{{clarify|date=November 2022}} (the method can also be weighted).
* Hash: allocates queries according to a [[hash table]].
* Power of Two Choices: pick two servers at random and choose the better of the two options.<ref>{{cite web |title=NGINX and the "Power of Two Choices" Load-Balancing Algorithm |url=https://www.nginx.com/blog/nginx-power-of-two-choices-load-balancing-algorithm/ |website=nginx.com |archive-url=https://web.archive.org/web/20191212194243/https://www.nginx.com/blog/nginx-power-of-two-choices-load-balancing-algorithm/ |archive-date=2019-12-12 |date=2018-11-12}}</ref><ref>{{cite web |title=Test Driving "Power of Two Random Choices" Load Balancing |url=https://www.haproxy.com/blog/power-of-two-load-balancing/ |website=haproxy.com |archive-url=https://web.archive.org/web/20190215173140/https://www.haproxy.com/blog/power-of-two-load-balancing/ |archive-date=2019-02-15 |date=2019-02-15}}</ref>

===Master-Worker Scheme===
[[Master/slave (technology)|Master-Worker]] schemes are among the simplest dynamic load balancing algorithms. A master distributes the workload to all workers (also sometimes referred to as "slaves"). Initially, all workers are idle and report this to the master. The master answers worker requests and distributes the tasks to them. When he has no more tasks to give, he informs the workers so that they stop asking for tasks.

The advantage of this system is that it distributes the burden very fairly. In fact, if one does not take into account the time needed for the assignment, the execution time would be comparable to the prefix sum seen above.

The problem with this algorithm is that it has difficulty adapting to a large number of processors because of the high amount of necessary communications. This lack of [[scalability]] makes it quickly inoperable in very large servers or very large parallel computers. The master acts as a [[Bottleneck (software)|bottleneck]].
[[File:Master-Worker_and_bottleneck.png|thumb|Master-Worker and bottleneck]]

However, the quality of the algorithm can be greatly improved by replacing the master with a task list that can be used by different processors. Although this algorithm is a little more difficult to implement, it promises much better scalability, although still insufficient for very large computing centers.

===Non-hierarchical architecture, without knowledge of the system: [[work stealing]]===
Another technique to overcome scalability problems when the time needed for task completion is unknown is [[work stealing]].
 
The approach consists of assigning to each processor a certain number of tasks in a random or predefined manner, then allowing inactive processors to "steal" work from active or overloaded processors. Several implementations of this concept exist, defined by a task division model and by the rules determining the exchange between processors. While this technique can be particularly effective, it is difficult to implement because it is necessary to ensure that communication does not become the primary occupation of the processors instead of solving the problem.

In the case of atomic tasks, two main strategies can be distinguished, those where the processors with low load offer their computing capacity to those with the highest load, and those were the most loaded units wish to lighten the workload assigned to them. It has been shown<ref>{{cite journal |last1=Eager |first1=Derek L |last2=Lazowska |first2=Edward D |last3=Zahorjan |first3=John |title=A comparison of receiver-initiated and sender-initiated adaptive load sharing |journal=Performance Evaluation |date=1 March 1986 |volume=6 |issue=1 |pages=53–68 |doi=10.1016/0166-5316(86)90008-8 |language=en |issn=0166-5316}}</ref> that when the network is heavily loaded, it is more efficient for the least loaded units to offer their availability and when the network is lightly loaded, it is the overloaded processors that require support from the most inactive ones. This rule of thumb limits the number of exchanged messages.

In the case where one starts from a single large task that cannot be divided beyond an atomic level, there is a very efficient algorithm "Tree-Shaped computation",<ref>{{cite journal |last1=Sanders |first1=Peter |title=Tree Shaped Computations as a Model for Parallel Applications |journal=Workshop on Application Based Load Balancing (Alv '98), München, 25. - 26. März 1998 - Veranst. Vom Sonderforschungsbereich 342 "Werkzeuge und Methoden für die Nutzung Paralleler Rechnerarchitekturen". Ed.: A. Bode |date=1998 |page=123 |doi=10.5445/ir/1000074497}}</ref> where the parent task is distributed in a work tree.

====Principle====
Initially, many processors have an empty task, except one that works sequentially on it. Idle processors issue requests randomly to other processors (not necessarily active). If the latter is able to subdivide the task it is working on, it does so by sending part of its work to the node making the request. Otherwise, it returns an empty task. This induces a [[tree structure]]. It is then necessary to send a termination signal to the parent processor when the subtask is completed so that it, in turn, sends the message to its parent until it reaches the root of the tree. When the first processor, i.e. the root, has finished, a global termination message can be broadcast. In the end, it is necessary to assemble the results by going back up the tree.

====Efficiency====
The efficiency of such an algorithm is close to the prefix sum when the job cutting and communication time is not too high compared to the work to be done. To avoid too high communication costs, it is possible to imagine a list of jobs on [[shared memory]]. Therefore, a request is simply reading from a certain position on this shared memory at the request of the master processor.