Editing Parallel computing (section)

==Hardware==

===Memory and communication===
Main memory in a parallel computer is either [[Shared memory (interprocess communication)|shared memory]] (shared between all processing elements in a single [[address space]]), or [[distributed memory]] (in which each processing element has its own local address space).<ref name=PH713>Patterson and Hennessy, p.&nbsp;713.</ref> Distributed memory refers to the fact that the memory is logically distributed, but often implies that it is physically distributed as well. [[Distributed shared memory]] and [[memory virtualization]] combine the two approaches, where the processing element has its own local memory and access to the memory on non-local processors. Accesses to local memory are typically faster than accesses to non-local memory. On the [[supercomputers]], distributed shared memory space can be implemented using the programming model such as [[Partitioned global address space|PGAS]].  This model allows processes on one compute node to transparently access the remote memory of another compute node. All compute nodes are also connected to an external shared memory system via high-speed interconnect, such as [[Infiniband]], this external shared memory system is known as [[burst buffer]], which is typically built from arrays of [[non-volatile memory]] physically distributed across multiple I/O nodes.

[[File:Numa.svg|right|thumbnail|400px|A logical view of a [[non-uniform memory access]] (NUMA) architecture. Processors in one directory can access that directory's memory with less latency than they can access memory in the other directory's memory.]]

Computer architectures in which each element of main memory can be accessed with equal [[Memory latency|latency]] and [[Bandwidth (computing)|bandwidth]] are known as [[uniform memory access]] (UMA) systems. Typically, that can be achieved only by a [[Shared memory (interprocess communication)|shared memory]] system, in which the memory is not physically distributed. A system that does not have this property is known as a [[non-uniform memory access]] (NUMA) architecture. Distributed memory systems have non-uniform memory access.

Computer systems make use of [[CPU cache|cache]]s—small and fast memories located close to the processor which store temporary copies of memory values (nearby in both the physical and logical sense). Parallel computer systems have difficulties with caches that may store the same value in more than one location, with the possibility of incorrect program execution. These computers require a [[cache coherency]] system, which keeps track of cached values and strategically purges them, thus ensuring correct program execution. [[Bus sniffing|Bus snooping]] is one of the most common methods for keeping track of which values are being accessed (and thus should be purged). Designing large, high-performance cache coherence systems is a very difficult problem in computer architecture. As a result, shared memory computer architectures do not scale as well as distributed memory systems do.<ref name=PH713/>

Processor–processor and processor–memory communication can be implemented in hardware in several ways, including via shared (either multiported or [[Multiplexing|multiplexed]]) memory, a [[crossbar switch]], a shared [[Bus (computing)|bus]] or an interconnect network of a myriad of [[Network topology|topologies]] including [[Star network|star]], [[Ring network|ring]], [[Tree (graph theory)|tree]], [[Hypercube graph|hypercube]], fat hypercube (a hypercube with more than one processor at a node), or [[Mesh networking|n-dimensional mesh]].

Parallel computers based on interconnected networks need to have some kind of [[routing]] to enable the passing of messages between nodes that are not directly connected. The medium used for communication between the processors is likely to be hierarchical in large multiprocessor machines.

===Classes of parallel computers===
Parallel computers can be roughly classified according to the level at which the hardware supports parallelism. This classification is broadly analogous to the distance between basic computing nodes. These are not mutually exclusive; for example, clusters of symmetric multiprocessors are relatively common.

====Multi-core computing====
{{main|Multi-core processor}}
A multi-core processor is a processor that includes multiple [[Central processing unit|processing units]] (called "cores") on the same chip. This processor differs from a [[superscalar]] processor, which includes multiple [[execution unit]]s and can issue multiple instructions per clock cycle from one instruction stream (thread); in contrast, a multi-core processor can issue multiple instructions per clock cycle from multiple instruction streams. [[IBM]]'s [[Cell (microprocessor)|Cell microprocessor]], designed for use in the [[Sony]] [[PlayStation 3]], is a prominent multi-core processor. Each core in a multi-core processor can potentially be superscalar as well—that is, on every clock cycle, each core can issue multiple instructions from one thread.

[[Simultaneous multithreading]]  (of which Intel's [[Hyper-Threading]] is the best known) was an early form of pseudo-multi-coreism. A processor capable of concurrent multithreading includes multiple execution units in the same processing unit—that is it has a superscalar architecture—and can issue multiple instructions per clock cycle from ''multiple'' threads. [[Temporal multithreading]] on the other hand includes a single execution unit in the same processing unit and can issue one instruction at a time from ''multiple'' threads.

====Symmetric multiprocessing====
{{main|Symmetric multiprocessing}}
A symmetric multiprocessor (SMP) is a computer system with multiple identical processors that share memory and connect via a [[bus (computing)|bus]].<ref name=HP549>Hennessy and Patterson, p.&nbsp;549.</ref> [[Bus contention]] prevents bus architectures from scaling. As a result, SMPs generally do not comprise more than 32&nbsp;processors.<ref>Patterson and Hennessy, p.&nbsp;714.</ref> Because of the small size of the processors and the significant reduction in the requirements for bus bandwidth achieved by large caches, such symmetric multiprocessors are extremely cost-effective, provided that a sufficient amount of memory bandwidth exists.<ref name=HP549/>

====Distributed computing====
{{main|Distributed computing}}
A distributed computer (also known as a distributed memory multiprocessor) is a distributed memory computer system in which the processing elements are connected by a network. Distributed computers are highly scalable. The terms "[[concurrent computing]]", "parallel computing", and "distributed computing" have a lot of overlap, and no clear distinction exists between them.<ref>[[Distributed computing#CITEREFGhosh2007|Ghosh (2007)]], p. 10. [[Distributed computing#CITEREFKeidar2008|Keidar (2008)]].</ref> The same system may be characterized both as "parallel" and "distributed"; the processors in a typical distributed system run concurrently in parallel.<ref>[[Distributed computing#CITEREFLynch1996|Lynch (1996)]], p. xix, 1–2. [[Distributed computing#CITEREFPeleg2000|Peleg (2000)]], p. 1.</ref>

=====Cluster computing=====
{{main|Computer cluster}}

[[File:Beowulf.jpg|right|thumbnail|upright|A [[Beowulf (computing)|Beowulf cluster]]]]

A cluster is a group of loosely coupled computers that work together closely, so that in some respects they can be regarded as a single computer.<ref>[http://www.webopedia.com/TERM/c/clustering.html What is clustering?] Webopedia computer dictionary. Retrieved on November 7, 2007.</ref> Clusters are composed of multiple standalone machines connected by a network. While machines in a cluster do not have to be symmetric, [[Load balancing (computing)|load balancing]] is more difficult if they are not. The most common type of cluster is the [[Beowulf (computing)|Beowulf cluster]], which is a cluster implemented on multiple identical [[commercial off-the-shelf]] computers connected with a [[TCP/IP]] [[Ethernet]] [[local area network]].<ref>[https://www.pcmag.com/encyclopedia_term/0,2542,t=Beowulf&i=38548,00.asp Beowulf definition.] {{Webarchive|url=https://web.archive.org/web/20121010215231/https://www.pcmag.com/encyclopedia_term/0%2C2542%2Ct%3DBeowulf%26i%3D38548%2C00.asp |date=2012-10-10 }} ''PC Magazine''. Retrieved on November 7, 2007.</ref> Beowulf technology was originally developed by [[Thomas Sterling (computing)|Thomas Sterling]] and [[Donald Becker]]. 87% of all [[TOP500|Top500]] supercomputers are clusters.<ref>{{Cite web|url=https://www.top500.org/statistics/list/|title=List Statistics {{!}} TOP500 Supercomputer Sites|website=www.top500.org|language=en|access-date=2018-08-05}}</ref> The remaining are Massively Parallel Processors, explained below.

Because grid computing systems (described below) can easily handle embarrassingly parallel problems, modern clusters are typically designed to handle more difficult problems—problems that require nodes to share intermediate results with each other more often. This requires a high bandwidth and, more importantly, a low-[[latency (engineering)|latency]] interconnection network. Many historic and current supercomputers use customized high-performance network hardware specifically designed for cluster computing, such as the Cray Gemini network.<ref>[https://www.nersc.gov/users/computational-systems/hopper/configuration/interconnect/ "Interconnect"] {{webarchive|url=https://web.archive.org/web/20150128133120/https://www.nersc.gov/users/computational-systems/hopper/configuration/interconnect/ |date=2015-01-28 }}.</ref> As of 2014, most current supercomputers use some off-the-shelf standard network hardware, often [[Myrinet]], [[InfiniBand]], or [[Gigabit Ethernet]].

=====Massively parallel computing=====
{{main|Massively parallel (computing)}}

[[File:BlueGeneL cabinet.jpg|right|thumbnail|upright|A cabinet from [[IBM]]'s [[Blue Gene|Blue Gene/L]] massively parallel [[supercomputer]]]]

A massively parallel processor (MPP) is a single computer with many networked processors. MPPs have many of the same characteristics as clusters, but MPPs have specialized interconnect networks (whereas clusters use commodity hardware for networking). MPPs also tend to be larger than clusters, typically having "far more" than 100&nbsp;processors.<ref>Hennessy and Patterson, p.&nbsp;537.</ref> In an MPP, "each CPU contains its own memory and copy of the operating system and application. Each subsystem communicates with the others via a high-speed interconnect."<ref>[https://www.pcmag.com/encyclopedia_term/0,,t=mpp&i=47310,00.asp MPP Definition.] {{Webarchive|url=https://web.archive.org/web/20130511084523/https://www.pcmag.com/encyclopedia_term/0%2C%2Ct%3Dmpp%26i%3D47310%2C00.asp |date=2013-05-11 }} ''PC Magazine''. Retrieved on November 7, 2007.</ref>

[[IBM]]'s [[Blue Gene|Blue Gene/L]], the fifth fastest [[supercomputer]] in the world according to the June 2009 [[TOP500]] ranking, is an MPP.

=====Grid computing=====
{{main|Grid computing}}
Grid computing is the most distributed form of parallel computing. It makes use of computers communicating over the [[Internet]] to work on a given problem. Because of the low bandwidth and extremely high latency available on the Internet, distributed computing typically deals only with [[embarrassingly parallel]] problems.

Most grid computing applications use [[middleware]] (software that sits between the operating system and the application to manage network resources and standardize the software interface). The most common grid computing middleware is the [[Berkeley Open Infrastructure for Network Computing]] (BOINC). Often [[volunteer computing]] software makes use of "spare cycles", performing computations at times when a computer is idling.<ref>{{cite journal|last=Kirkpatrick|first=Scott|title=COMPUTER SCIENCE: Rough Times Ahead|journal=Science|volume=299|issue=5607|pages=668–669|doi=10.1126/science.1081623|year=2003|pmid=12560537|s2cid=60622095}}</ref>

=====Cloud computing=====
{{main|Cloud computing}}
The ubiquity of Internet brought the possibility of large-scale cloud computing.

====Specialized parallel computers====
Within parallel computing, there are specialized parallel devices that remain niche areas of interest. While not [[Domain-specific programming language|domain-specific]], they tend to be applicable to only a few classes of parallel problems.

=====Reconfigurable computing with field-programmable gate arrays=====
[[Reconfigurable computing]] is the use of a [[field-programmable gate array]] (FPGA) as a co-processor to a general-purpose computer. An FPGA is, in essence, a computer chip that can rewire itself for a given task.

FPGAs can be programmed with [[hardware description language]]s such as [[VHDL]]<ref>{{Cite journal|last1=Valueva|first1=Maria|last2=Valuev|first2=Georgii|last3=Semyonova|first3=Nataliya|last4=Lyakhov|first4=Pavel|last5=Chervyakov|first5=Nikolay|last6=Kaplun|first6=Dmitry|last7=Bogaevskiy|first7=Danil|date=2019-06-20|title=Construction of Residue Number System Using Hardware Efficient Diagonal Function|journal=Electronics|language=en|volume=8|issue=6|pages=694|doi=10.3390/electronics8060694|issn=2079-9292|quote=All simulated circuits were described in very high speed integrated circuit (VHSIC) hardware description language (VHDL). Hardware modeling was performed on Xilinx FPGA Artix 7 xc7a200tfbg484-2.|doi-access=free}}</ref> or [[Verilog]].<ref>{{Cite book|last1=Gupta|first1=Ankit|last2=Suneja|first2=Kriti|title=2020 4th International Conference on Intelligent Computing and Control Systems (ICICCS) |chapter=Hardware Design of Approximate Matrix Multiplier based on FPGA in Verilog |date=May 2020|chapter-url=https://ieeexplore.ieee.org/document/9121004|location=Madurai, India|publisher=IEEE|pages=496–498|doi=10.1109/ICICCS48265.2020.9121004|isbn=978-1-7281-4876-2|s2cid=219990653}}</ref> Several vendors have created [[C to HDL]] languages that attempt to emulate the syntax and semantics of the [[C programming language]], with which most programmers are familiar. The best known C to HDL languages are [[Mitrionics|Mitrion-C]], [[Impulse C]], and [[Handel-C]]. Specific subsets of [[SystemC]] based on C++ can also be used for this purpose.

AMD's decision to open its [[HyperTransport]] technology to third-party vendors has become the enabling technology for high-performance reconfigurable computing.<ref name="DAmour">D'Amour, Michael R., Chief Operating Officer, DRC Computer Corporation. "Standard Reconfigurable Computing". Invited speaker at the University of Delaware, February 28, 2007.</ref> According to Michael R. D'Amour, Chief Operating Officer of DRC Computer Corporation, "when we first walked into AMD, they called us 'the [[CPU socket|socket]] stealers.' Now they call us their partners."<ref name="DAmour"/>

=====General-purpose computing on graphics processing units (GPGPU)=====
{{main|GPGPU}}

[[File:NvidiaTesla.jpg|right|thumbnail|Nvidia's [[Nvidia Tesla|Tesla GPGPU card]]]]

General-purpose computing on [[graphics processing unit]]s (GPGPU) is a fairly recent trend in computer engineering research. GPUs are co-processors that have been heavily optimized for [[computer graphics]] processing.<ref>Boggan, Sha'Kia and Daniel M. Pressel (August 2007). [https://discover.dtic.mil/results/?q=ARL-SR-154 GPUs: An Emerging Platform for General-Purpose Computation]  (PDF). ARL-SR-154, U.S. Army Research Lab. Retrieved on November 7, 2007.</ref> Computer graphics processing is a field dominated by data parallel operations—particularly [[linear algebra]] [[Matrix (mathematics)|matrix]] operations.

In the early days, GPGPU programs used the normal graphics APIs for executing programs. However, several new programming languages and platforms have been built to do general purpose computation on GPUs with both [[Nvidia]] and [[AMD]] releasing programming environments with [[CUDA]] and [[AMD FireStream#Software Development Kit|Stream SDK]] respectively. Other GPU programming languages include [[BrookGPU]], [[PeakStream]], and [[RapidMind]]. Nvidia has also released specific products for computation in their [[Nvidia Tesla|Tesla series]]. The technology consortium Khronos Group has released the [[OpenCL]] specification, which is a framework for writing programs that execute across platforms consisting of CPUs and GPUs. [[AMD]], [[Apple Inc.|Apple]], [[Intel]], [[Nvidia]] and others are supporting [[OpenCL]].

=====Application-specific integrated circuits=====
{{main|Application-specific integrated circuit}}
Several [[application-specific integrated circuit]] (ASIC) approaches have been devised for dealing with parallel applications.<ref>Maslennikov, Oleg (2002). [https://doi.org/10.1007%2F3-540-48086-2_30 "Systematic Generation of Executing Programs for Processor Elements in Parallel ASIC or FPGA-Based Systems and Their Transformation into VHDL-Descriptions of Processor Element Control Units".] ''Lecture Notes in Computer Science'', '''2328/2002:''' p.&nbsp;272.</ref><ref>{{cite book|last=Shimokawa|first=Y.|author2=Fuwa, Y. |author3=Aramaki, N. |title=&#91;Proceedings&#93; 1991 IEEE International Joint Conference on Neural Networks |chapter=A parallel ASIC VLSI neurocomputer for a large number of neurons and billion connections per second speed |date=18–21 November 1991|volume=3|pages=2162–2167|doi=10.1109/IJCNN.1991.170708|isbn=978-0-7803-0227-3|s2cid=61094111}}</ref><ref>{{cite journal|last=Acken|first=Kevin P.|author2=Irwin, Mary Jane |author3=Owens, Robert M.|title=A Parallel ASIC Architecture for Efficient Fractal Image Coding |journal=The Journal of VLSI Signal Processing|date=July 1998|volume=19|issue=2|pages=97–113|doi=10.1023/A:1008005616596|bibcode=1998JSPSy..19...97A |s2cid=2976028}}</ref>

Because an ASIC is (by definition) specific to a given application, it can be fully optimized for that application. As a result, for a given application, an ASIC tends to outperform a general-purpose computer. However, ASICs are created by [[photolithography|UV photolithography]]. This process requires a mask set, which can be extremely expensive. A mask set can cost over a million US dollars.<ref>Kahng, Andrew B. (June 21, 2004) "[http://www.future-fab.com/documents.asp?grID=353&d_ID=2596 Scoping the Problem of DFM in the Semiconductor Industry] {{webarchive|url=https://web.archive.org/web/20080131221732/http://www.future-fab.com/documents.asp?grID=353&d_ID=2596 |date=2008-01-31 }}." University of California, San Diego. "Future design for manufacturing (DFM) technology must reduce design [non-recoverable expenditure] cost and directly address manufacturing [non-recoverable expenditures]—the cost of a mask set and probe card—which is well over $1&nbsp;million at the 90&nbsp;nm technology node and creates a significant damper on semiconductor-based innovation."</ref> (The smaller the transistors required for the chip, the more expensive the mask will be.) Meanwhile, performance increases in general-purpose computing over time (as described by [[Moore's law]]) tend to wipe out these gains in only one or two chip generations.<ref name="DAmour"/> High initial cost, and the tendency to be overtaken by Moore's-law-driven general-purpose computing, has rendered ASICs unfeasible for most parallel computing applications. However, some have been built. One example is the PFLOPS [[RIKEN MDGRAPE-3]] machine which uses custom ASICs for [[molecular dynamics]] simulation.

=====Vector processors=====
{{main|Vector processor}}

[[File:Cray 1 IMG 9126.jpg|right|thumbnail|The [[Cray-1]] is a vector processor.]]

A vector processor is a CPU or computer system that can execute the same instruction on large sets of data. Vector processors have high-level operations that work on linear arrays of numbers or vectors. An example vector operation is ''A'' = ''B'' × ''C'', where ''A'', ''B'', and ''C'' are each 64-element vectors of 64-bit [[floating-point]] numbers.<ref name=PH751>Patterson and Hennessy, p.&nbsp;751.</ref> They are closely related to Flynn's SIMD classification.<ref name=PH751/>

[[Cray]] computers became famous for their vector-processing computers in the 1970s and 1980s. However, vector processors—both as CPUs and as full computer systems—have generally disappeared. Modern [[Instruction set|processor instruction sets]] do include some vector processing instructions, such as with [[Freescale Semiconductor]]'s [[AltiVec]] and [[Intel]]'s [[Streaming SIMD Extensions]] (SSE).