Editing Parallel computing (section)

==Background==
Traditionally, [[computer software]] has been written for [[serial computation]]. To solve a problem, an [[algorithm]] is constructed and implemented as a serial stream of instructions. These instructions are executed on a [[central processing unit]] on one computer. Only one instruction may execute at a time—after that instruction is finished, the next one is executed.<ref name="llnltut">{{cite web |author=Barney, Blaise |title=Introduction to Parallel Computing |url=http://www.llnl.gov/computing/tutorials/parallel_comp/ |access-date=2007-11-09 |publisher=Lawrence Livermore National Laboratory |language=en-US}}</ref>

Parallel computing, on the other hand, uses multiple processing elements simultaneously to solve a problem. This is accomplished by breaking the problem into independent parts so that each processing element can execute its part of the algorithm simultaneously with the others. The processing elements can be diverse and include resources such as a single computer with multiple processors, several networked computers, specialized hardware, or any combination of the above.<ref name="llnltut" /> Historically parallel computing was used for scientific computing and the simulation of scientific problems, particularly in the natural and [[engineering sciences]], such as [[meteorology]]. This led to the design of parallel hardware and software, as well as [[high performance computing]].<ref>{{Cite book|title=Parallel Programming: for Multicore and Cluster Systems|author=Thomas Rauber |author2=Gudula Rünger|publisher=Springer Science & Business Media|year=2013|isbn=9783642378010|page=1}}</ref>

[[Frequency scaling]] was the dominant reason for improvements in [[computer performance]] from the mid-1980s until 2004. The [[Run time (program lifecycle phase)|runtime]] of a program is equal to the number of instructions multiplied by the average time per instruction. Maintaining everything else constant, increasing the clock frequency decreases the average time it takes to execute an instruction. An increase in frequency thus decreases runtime for all [[CPU bound|compute-bound]] programs.<ref>{{cite book|last=Hennessy|first=John L.|title=Computer architecture / a quantitative approach.|year=2002|publisher=International Thomson|location=San Francisco, Calif.|isbn=978-1-55860-724-8|edition=3rd|author2=Patterson, David A.|page=43}}</ref> However, power consumption ''P'' by a chip is given by the equation ''P'' = ''C'' × ''V'' <sup>2</sup> × ''F'', where ''C'' is the [[capacitance]] being switched per clock cycle (proportional to the number of transistors whose inputs change), ''V'' is [[voltage]], and ''F'' is the processor frequency (cycles per second).<ref>{{cite book|last=Rabaey|first=Jan M.|title=Digital integrated circuits : a design perspective|year=1996|publisher=Prentice-Hall|location=Upper Saddle River, N.J.|isbn=978-0-13-178609-7|page=235}}</ref> Increases in frequency increase the amount of power used in a processor. Increasing processor power consumption led ultimately to [[Intel]]'s May 8, 2004 cancellation of its [[Tejas and Jayhawk]] processors, which is generally cited as the end of frequency scaling as the dominant computer architecture paradigm.<ref>{{cite news|last=Flynn|first=Laurie J.|title=Intel Halts Development Of 2 New Microprocessors|url=https://www.nytimes.com/2004/05/08/business/08chip.html?ex=1399348800&en=98cc44ca97b1a562&ei=5007|access-date=5 June 2012|newspaper=New York Times|date=8 May 2004}}</ref>

To deal with the problem of power consumption and overheating the major [[central processing unit]] (CPU or processor) manufacturers started to produce power efficient processors with multiple cores. The core is the computing unit of the processor and in multi-core processors each core is independent and can access the same memory concurrently. [[Multi-core processor]]s have brought parallel computing to [[desktop computers]]. Thus parallelization of serial programs has become a mainstream programming task. In 2012 quad-core processors became standard for [[desktop computers]], while [[Server (computing)|servers]] had 10+ core processors.  By 2023  some processors had over hundred cores.  Some designs having a mix of performance and efficiency cores (such as [[ARM big.LITTLE|ARM's big.LITTLE]] design) due to thermal and design constraints.<ref>{{Cite book|title=Parallel Programming: for Multicore and Cluster Systems|author=Thomas Rauber |author2=Gudula Rünger|publisher=Springer Science & Business Media|year=2013|isbn=9783642378010|page=2}}</ref>{{Citation needed|date=March 2023}} From [[Moore's law]] it can be predicted that the number of cores per processor will double every 18–24 months.

An [[operating system]] can ensure that different tasks and user programs are run in parallel on the available cores. However, for a serial software program to take full advantage of the multi-core architecture the programmer needs to restructure and parallelize the code. A speed-up of application software runtime will no longer be achieved through frequency scaling, instead programmers will need to parallelize their software code to take advantage of the increasing computing power of multicore architectures.<ref>{{Cite book|title=Parallel Programming: for Multicore and Cluster Systems|author=Thomas Rauber |author2=Gudula Rünger|publisher=Springer Science & Business Media|year=2013|isbn=9783642378010|page=3}}</ref>

=== Relevant laws ===
[[File:AmdahlsLaw.svg|right|thumbnail|300px|A graphical representation of [[Amdahl's law]]. The law demonstrates the theoretical maximum [[speedup]] of an overall system and the concept of diminishing returns. If exactly 50% of the work can be parallelized, the best possible speedup is 2 times. If 95% of the work can be parallelized, the best possible speedup is 20 times. According to the law, even with an infinite number of processors, the speedup is constrained by the unparallelizable portion.]]
[[File:Optimizing-different-parts.svg|thumb|300px|Assume that a task has two independent parts, ''A'' and ''B''. Part ''B'' takes roughly 25% of the time of the whole computation. By working very hard, one may be able to make this part 5 times faster, but this only reduces the time for the whole computation by a little. In contrast, one may need to perform less work to make part ''A'' twice as fast. This will make the computation much faster than by optimizing part ''B'', even though part ''B'''s speedup is greater by ratio, (5 times versus 2 times).]]

Main article: [[Amdahl's law]]

Optimally, the [[speedup]] from parallelization would be linear—doubling the number of processing elements should halve the runtime, and doubling it a second time should again halve the runtime. However, very few parallel algorithms achieve optimal speedup. Most of them have a near-linear speedup for small numbers of processing elements, which flattens out into a constant value for large numbers of processing elements.

The maximum potential speedup of an overall system can be calculated by [[Amdahl's law]].<ref name=":02">{{Citation |last=Bakos |first=Jason D. |title=Chapter 2 - Multicore and data-level optimization: OpenMP and SIMD |date=2016-01-01 |work=Embedded Systems |pages=49–103 |editor-last=Bakos |editor-first=Jason D. |url=https://linkinghub.elsevier.com/retrieve/pii/B978012800342800002X |access-date=2024-11-18 |place=Boston |publisher=Morgan Kaufmann |doi=10.1016/b978-0-12-800342-8.00002-x |isbn=978-0-12-800342-8|url-access=subscription }}</ref> Amdahl's Law indicates that optimal performance improvement is achieved by balancing enhancements to both parallelizable and non-parallelizable components of a task. Furthermore, it reveals that increasing the number of processors yields diminishing returns, with negligible speedup gains beyond a certain point. <ref>{{Cite book |title=The Art of Multiprocessor Programming, Revised Reprint |date=22 May 2012 |publisher=Morgan Kaufmann |isbn=9780123973375}}</ref><ref>{{Cite book |title=Programming Many-Core Chips |isbn=9781441997395 |last1=Vajda |first1=András |date=10 June 2011 |publisher=Springer}}</ref>

Amdahl's Law has limitations, including assumptions of fixed workload, neglecting [[inter-process communication]] and [[Synchronization (computer science)|synchronization]] overheads, primarily focusing on computational aspect and ignoring extrinsic factors such as data persistence, I/O operations, and memory access overheads.<ref>{{Cite book |last=Amdahl |first=Gene M. |chapter=Validity of the single processor approach to achieving large scale computing capabilities |date=1967-04-18 |title=Proceedings of the April 18-20, 1967, spring joint computer conference on - AFIPS '67 (Spring) |chapter-url=https://dl.acm.org/doi/10.1145/1465482.1465560 |location=New York, NY, USA |publisher=Association for Computing Machinery |pages=483–485 |doi=10.1145/1465482.1465560 |isbn=978-1-4503-7895-6}}</ref><ref name=":1">{{Cite book |title=Computer Architecture: A Quantitative Approach |date=2003 |publisher=Morgan Kaufmann |isbn=978-8178672663}}</ref><ref>{{Cite book |title=Parallel Computer Architecture A Hardware/Software Approach |date=1999 |publisher=Elsevier Science |isbn=9781558603431}}</ref>

[[Gustafson's law]] and [[Neil J. Gunther#Universal Scalability Law|Universal Scalability Law]] give a more realistic assessment of the parallel performance.<ref>{{cite book |last1=McCool |first1=Michael |title=Structured Parallel Programming: Patterns for Efficient Computation |last2=Reinders |first2=James |last3=Robison |first3=Arch |publisher=Elsevier |year=2013 |isbn=978-0-12-415993-8 |pages=61}}</ref><ref>{{Cite book |last=Gunther |first=Neil |title=Guerrilla Capacity Planning: A Tactical Approach to Planning for Highly Scalable Applications and Services |year=2007 |isbn=978-3540261384}}</ref>[[File:Gustafson.png|thumb|right|300px|A graphical representation of [[Gustafson's law]]]]

===Dependencies===
Understanding [[data dependency|data dependencies]] is fundamental in implementing [[parallel algorithm]]s. No program can run more quickly than the longest chain of dependent calculations (known as the [[Critical path method|critical path]]), since calculations that depend upon prior calculations in the chain must be executed in order. However, most algorithms do not consist of just a long chain of dependent calculations; there are usually opportunities to execute independent calculations in parallel.

Let ''P''<sub>''i''</sub> and ''P''<sub>''j''</sub> be two program segments. Bernstein's conditions<ref>{{cite journal|last=Bernstein|first=Arthur J.|title=Analysis of Programs for Parallel Processing|journal=IEEE Transactions on Electronic Computers|date=1 October 1966|volume=EC-15|issue=5|pages=757–763|doi=10.1109/PGEC.1966.264565}}</ref> describe when the two are independent and can be executed in parallel. For ''P''<sub>''i''</sub>, let ''I''<sub>''i''</sub> be all of the input variables and ''O''<sub>''i''</sub> the output variables, and likewise for ''P''<sub>''j''</sub>. ''P''<sub>''i''</sub> and ''P''<sub>''j''</sub> are independent if they satisfy
: <math>I_j \cap O_i = \varnothing,</math>
: <math>I_i \cap O_j = \varnothing,</math>
: <math>O_i \cap O_j = \varnothing.</math>

Violation of the first condition introduces a flow dependency, corresponding to the first segment producing a result used by the second segment. The second condition represents an anti-dependency, when the second segment produces a variable needed by the first segment. The third and final condition represents an output dependency: when two segments write to the same location, the result comes from the logically last executed segment.<ref>{{cite book|last=Roosta|first=Seyed H.|title=Parallel processing and parallel algorithms : theory and computation|year=2000|publisher=Springer|location=New York, NY [u.a.]|isbn=978-0-387-98716-3|page=114}}</ref>

Consider the following functions, which demonstrate several kinds of dependencies:

 1: function Dep(a, b)
 2: c := a * b
 3: d := 3 * c
 4: end function

In this example, instruction 3 cannot be executed before (or even in parallel with) instruction 2, because instruction 3 uses a result from instruction 2. It violates condition 1, and thus introduces a flow dependency.

 1: function NoDep(a, b)
 2: c := a * b
 3: d := 3 * b
 4: e := a + b
 5: end function

In this example, there are no dependencies between the instructions, so they can all be run in parallel.

Bernstein's conditions do not allow memory to be shared between different processes. For that, some means of enforcing an ordering between accesses is necessary, such as [[Semaphore (programming)|semaphores]], [[Barrier (computer science)|barriers]] or some other [[Synchronization (computer science)|synchronization method]].

===Race conditions, mutual exclusion, synchronization, and parallel slowdown===
Subtasks in a parallel program are often called [[Thread (computing)|threads]]. Some parallel computer architectures use smaller, lightweight versions of threads known as [[Fiber (computer science)|fibers]], while others use bigger versions known as [[Process (computing)|processes]]. However, "threads" is generally accepted as a generic term for subtasks.<ref>{{cite web
 | url = https://msdn.microsoft.com/en-us/library/windows/desktop/ms684841(v=vs.85).aspx
 | title = Processes and Threads
 | date = 2018
 | website = Microsoft Developer Network
 | publisher = Microsoft Corp.
 | access-date = 2018-05-10
}}</ref> Threads will often need [[Synchronization (computer science)|synchronized]] access to an [[Object (computer science)|object]] or other [[Resource management (computing)|resource]], for example when they must update a [[Variable (programming)|variable]] that is shared between them. Without synchronization, the instructions between the two threads may be interleaved in any order. For example, consider the following program:

{| class="wikitable"
|-
!Thread A
!Thread B
|-
|1A: Read variable V
|1B: Read variable V
|-
|2A: Add 1 to variable V
|2B: Add 1 to variable V
|-
|3A: Write back to variable V
|3B: Write back to variable V
|}

If instruction 1B is executed between 1A and 3A, or if instruction 1A is executed between 1B and 3B, the program will produce incorrect data. This is known as a [[race condition]]. The programmer must use a [[Lock (computer science)|lock]] to provide [[mutual exclusion]]. A lock is a programming language construct that allows one thread to take control of a variable and prevent other threads from reading or writing it, until that variable is unlocked. The thread holding the lock is free to execute its [[critical section]] (the section of a program that requires exclusive access to some variable), and to unlock the data when it is finished. Therefore, to guarantee correct program execution, the above program can be rewritten to use locks:

{| class="wikitable"
|-
!Thread A
!Thread B
|-
|1A: Lock variable V
|1B: Lock variable V
|-
|2A: Read variable V
|2B: Read variable V
|-
|3A: Add 1 to variable V
|3B: Add 1 to variable V
|-
|4A: Write back to variable V
|4B: Write back to variable V
|-
|5A: Unlock variable V
|5B: Unlock variable V
|}

One thread will successfully lock variable V, while the other thread will be [[Software lockout|locked out]]—unable to proceed until V is unlocked again. This guarantees correct execution of the program. Locks may be necessary to ensure correct program execution when threads must serialize access to resources, but their use can greatly slow a program and may affect its [[Software quality#Reliability|reliability]].<ref>{{cite web
 | url = http://www.developforperformance.com/ThreadSafetyForPerformance.html
 | title = Thread Safety for Performance
 | last = Krauss
 | first = Kirk J
 | date = 2018
 | website = Develop for Performance
 | access-date = 2018-05-10
 | archive-date = 2018-05-13
 | archive-url = https://web.archive.org/web/20180513081315/http://www.developforperformance.com/ThreadSafetyForPerformance.html
 | url-status = dead
 }}</ref>

Locking multiple variables using [[Atomic operation|non-atomic]] locks introduces the possibility of program [[Deadlock (computer science)|deadlock]]. An [[atomic lock]] locks multiple variables all at once. If it cannot lock all of them, it does not lock any of them. If two threads each need to lock the same two variables using non-atomic locks, it is possible that one thread will lock one of them and the second thread will lock the second variable. In such a case, neither thread can complete, and deadlock results.<ref>{{cite book
 | url = http://www.informit.com/articles/article.aspx?p=25193
 | title = Introduction to Operating System Deadlocks
 | last = Tanenbaum
 | first = Andrew S.
 | date = 2002-02-01
 | website = Informit
 | publisher = Pearson Education, Informit
 | access-date = 2018-05-10
}}</ref>

Many parallel programs require that their subtasks [[Synchronization (computer science)|act in synchrony]]. This requires the use of a [[barrier (computer science)|barrier]]. Barriers are typically implemented using a lock or a [[Semaphore (programming)|semaphore]].<ref>{{cite web
 | url = https://www.embedded.com/design/operating-systems/4440752/Synchronization-internals----the-semaphore
 | title = Synchronization internals &ndash; the semaphore
 | last = Cecil
 | first = David
 | date = 2015-11-03
 | website = Embedded
 | publisher = AspenCore
 | access-date = 2018-05-10
}}</ref> One class of algorithms, known as [[lock-free and wait-free algorithms]], altogether avoids the use of locks and barriers. However, this approach is generally difficult to implement and requires correctly designed data structures.<ref>{{cite web
 | url = http://preshing.com/20120612/an-introduction-to-lock-free-programming/
 | title = An Introduction to Lock-Free Programming
 | last = Preshing
 | first = Jeff
 | date = 2012-06-08
 | website = Preshing on Programming
 | access-date = 2018-05-10
}}</ref>

Not all parallelization results in speed-up. Generally, as a task is split up into more and more threads, those threads spend an ever-increasing portion of their time communicating with each other or waiting on each other for access to resources.<ref>{{cite web
 | url = https://stackoverflow.com/questions/806569/whats-the-opposite-of-embarrassingly-parallel
 | title = What's the opposite of "embarrassingly parallel"?
 | website = StackOverflow
 | access-date = 2018-05-10
}}</ref><ref>{{cite web
 | url = https://stackoverflow.com/questions/1970345/what-is-thread-contention
 | title = What is thread contention?
 | last = Schwartz
 | first = David
 | date = 2011-08-15
 | website = StackOverflow
 | access-date = 2018-05-10
}}</ref> Once the overhead from resource contention or communication dominates the time spent on other computation, further parallelization (that is, splitting the workload over even more threads) increases rather than decreases the amount of time required to finish. This problem, known as [[parallel slowdown]],<ref>{{cite web
 | url = https://software.intel.com/en-us/blogs/2008/03/04/why-a-simple-test-can-get-parallel-slowdown
 | title = Why a simple test can get parallel slowdown
 | last = Kukanov
 | first = Alexey
 | date = 2008-03-04
 | access-date = 2015-02-15
}}</ref> can be improved in some cases by software analysis and redesign.<ref>{{cite web
 | url = http://www.developforperformance.com/ThreadingForPerformance.html
 | title = Threading for Performance
 | last = Krauss
 | first = Kirk J
 | date = 2018
 | website = Develop for Performance
 | access-date = 2018-05-10
 | archive-date = 2018-05-13
 | archive-url = https://web.archive.org/web/20180513081501/http://www.developforperformance.com/ThreadingForPerformance.html
 | url-status = dead
 }}</ref>

===Fine-grained, coarse-grained, and embarrassing parallelism===
Applications are often classified according to how often their subtasks need to synchronize or communicate with each other. An application exhibits fine-grained parallelism if its subtasks must communicate many times per second; it exhibits coarse-grained parallelism if they do not communicate many times per second, and it exhibits [[Embarrassingly parallel|embarrassing parallelism]] if they rarely or never have to communicate. Embarrassingly parallel applications are considered the easiest to parallelize.

===Flynn's taxonomy===
{{Main|Flynn's taxonomy}}
[[Michael J. Flynn]] created one of the earliest classification systems for parallel (and sequential) computers and programs, now known as [[Flynn's taxonomy]]. Flynn classified programs and computers by whether they were operating using a single set or multiple sets of instructions, and whether or not those instructions were using a single set or multiple sets of data.

{{Flynn's taxonomy}}

The single-instruction-single-data (SISD) classification is equivalent to an entirely sequential program. The single-instruction-multiple-data (SIMD) classification is analogous to doing the same operation repeatedly over a large data set. This is commonly done in [[signal processing]] applications. Multiple-instruction-single-data (MISD) is a rarely used classification. While computer architectures to deal with this were devised (such as [[systolic array]]s), few applications that fit this class materialized. Multiple-instruction-multiple-data (MIMD) programs are by far the most common type of parallel programs.

According to [[David A. Patterson (scientist)|David A. Patterson]] and [[John L. Hennessy]], "Some machines are hybrids of these categories, of course, but this classic model has survived because it is simple, easy to understand, and gives a good first approximation. It is also—perhaps because of its understandability—the most widely used scheme."<ref>Patterson and Hennessy, p.&nbsp;748.</ref>