Editing Parallel computing (section)

==Granularity==

===Bit-level parallelism===
{{main|Bit-level parallelism}}
[[File:Taiwania 3 Supercomputer.jpg|thumb|Taiwania 3 of [[Taiwan]], a parallel supercomputing device that joined [[COVID-19]] research]]
From the advent of [[very-large-scale integration]] (VLSI) computer-chip fabrication technology in the 1970s until about 1986, speed-up in computer architecture was driven by doubling [[Word (data type)|computer word size]]—the amount of information the processor can manipulate per cycle.<ref>{{cite book|last=Singh|first=David Culler; J.P.|title=Parallel computer architecture|year=1997|publisher=Morgan Kaufmann Publ.|location=San Francisco|isbn=978-1-55860-343-1|page=15|edition=[Nachdr.]}}</ref> Increasing the word size reduces the number of instructions the processor must execute to perform an operation on variables whose sizes are greater than the length of the word. For example, where an [[8-bit computing|8-bit]] processor must add two [[16-bit computing|16-bit]] [[integer]]s, the processor must first add the 8&nbsp;lower-order bits from each integer using the standard addition instruction, then add the 8&nbsp;higher-order bits using an add-with-carry instruction and the [[carry bit]] from the lower order addition; thus, an 8-bit processor requires two instructions to complete a single operation, where a 16-bit processor would be able to complete the operation with a single instruction.

Historically, [[4-bit computing|4-bit]] microprocessors were replaced with 8-bit, then 16-bit, then 32-bit microprocessors. This trend generally came to an end with the introduction of 32-bit processors, which has been a standard in general-purpose computing for two decades. Not until the early 2000s, with the advent of [[x86-64]] architectures, did [[64-bit computing|64-bit]] processors become commonplace.

===Instruction-level parallelism===
{{main|Instruction-level parallelism}}
[[File:Nopipeline.png|thumb|300px|A canonical processor without [[Instruction pipelining|pipeline]]. It takes five clock cycles to complete one instruction and thus the processor can issue subscalar performance ({{nobreak|1=IPC = 0.2 < 1}}).]]

A computer program is, in essence, a stream of instructions executed by a processor. Without instruction-level parallelism, a processor can only issue less than one [[Instructions per cycle|instruction per clock cycle]] ({{nobreak|IPC < 1}}). These processors are known as ''subscalar'' processors. These instructions can be [[Out-of-order execution|re-ordered]] and combined into groups which are then executed in parallel without changing the result of the program. This is known as instruction-level parallelism. Advances in instruction-level parallelism dominated computer architecture from the mid-1980s until the mid-1990s.<ref>Culler et al. p.&nbsp;15.</ref>

[[File:Fivestagespipeline.png|thumb|300px|A canonical five-stage [[Instruction pipelining|pipelined]] processor. In the best case scenario, it takes one clock cycle to complete one instruction and thus the processor can issue scalar performance ({{nobreak|1=IPC = 1}}).]]

All modern processors have multi-stage [[Instruction pipelining|instruction pipelines]]. Each stage in the pipeline corresponds to a different action the processor performs on that instruction in that stage; a processor with an ''N''-stage pipeline can have up to ''N'' different instructions at different stages of completion and thus can issue one instruction per clock cycle ({{nobreak|1=IPC = 1}}). These processors are known as ''scalar'' processors. The canonical example of a pipelined processor is a [[RISC]] processor, with five stages: instruction fetch (IF), instruction decode (ID), execute (EX), memory access (MEM), and register write back (WB). The [[Pentium 4]] processor had a 35-stage pipeline.<ref>[[Yale Patt|Patt, Yale]] (April 2004). "[http://users.ece.utexas.edu/~patt/Videos/talk_videos/cmu_04-29-04.wmv The Microprocessor Ten Years From Now: What Are The Challenges, How Do We Meet Them?] {{webarchive|url=https://web.archive.org/web/20080414141000/http://users.ece.utexas.edu/~patt/Videos/talk_videos/cmu_04-29-04.wmv |date=2008-04-14 }} (wmv). Distinguished Lecturer talk at [[Carnegie Mellon University]]. Retrieved on November 7, 2007.</ref>

[[File:Superscalarpipeline.svg|thumb|300px|A canonical five-stage [[Instruction pipelining|pipelined]] processor with two execution units. In the best case scenario, it takes one clock cycle to complete two instructions and thus the processor can issue superscalar performance ({{nobreak|1=IPC = 2 > 1}}).]]

Most modern processors also have multiple [[execution unit]]s. They usually combine this feature with pipelining and thus can issue more than one instruction per clock cycle ({{nobreak|IPC > 1}}). These processors are known as ''[[superscalar]]'' processors. Superscalar processors differ from [[multi-core processor]]s in that the several execution units are not entire processors (i.e. processing units). Instructions can be grouped together only if there is no [[data dependency]] between them. [[Scoreboarding]] and the [[Tomasulo algorithm]] (which is similar to scoreboarding but makes use of [[register renaming]]) are two of the most common techniques for implementing out-of-order execution and instruction-level parallelism.

===Task parallelism===
{{main|Task parallelism}}
Task parallelisms is the characteristic of a parallel program that "entirely different calculations can be performed on either the same or different sets of data".<ref name=Culler124>Culler et al. p.&nbsp;124.</ref> This contrasts with data parallelism, where the same calculation is performed on the same or different sets of data. Task parallelism involves the decomposition of a task into sub-tasks and then allocating each sub-task to a processor for execution. The processors would then execute these sub-tasks concurrently and often cooperatively. Task parallelism does not usually scale with the size of a problem.<ref name=Culler125>Culler et al. p.&nbsp;125.</ref>

===Superword level parallelism===
Superword level parallelism is a [[Automatic vectorization|vectorization]] technique based on [[loop unwinding|loop unrolling]] and basic block vectorization. It is distinct from loop vectorization algorithms in that it can exploit [[Parallelism (computing)|parallelism]] of [[inline code]], such as manipulating coordinates, color channels or in loops unrolled by hand.<ref>{{cite web|title=Exploiting Superword Level Parallelism with Multimedia Instruction Sets|url=http://groups.csail.mit.edu/cag/slp/SLP-PLDI-2000.pdf|author=Samuel Larsen|author2=Saman Amarasinghe }}</ref>