Editing Vector processor (section)

== History ==

=== Early research and development ===

Vector processing development began in the early 1960s at the [[Westinghouse Electric Corporation]] in their ''Solomon'' project. Solomon's goal was to dramatically increase math performance by using a large number of simple [[coprocessor]]s under the control of a single master [[Central processing unit]] (CPU). The CPU fed a single common instruction to all of the [[arithmetic logic unit]]s (ALUs), one per cycle, but with a different data point for each one to work on. This allowed the Solomon machine to apply a single [[algorithm]] to a large [[data set]], fed in the form of an array.{{cn|date=July 2023}}

In 1962, Westinghouse cancelled the project, but the effort was restarted by the [[University of Illinois at Urbana–Champaign]] as the [[ILLIAC IV]]. Their version of the design originally called for a 1 [[GFLOPS]] machine with 256 ALUs, but, when it was finally delivered in 1972, it had only 64 ALUs and could reach only 100 to 150 MFLOPS. Nevertheless, it showed that the basic concept was sound, and, when used on data-intensive applications, such as [[computational fluid dynamics]], the ILLIAC was the fastest machine in the world. The ILLIAC approach of using separate ALUs for each data element is not common to later designs, and is often referred to under a separate category, [[massively parallel]] computing. Around this time Flynn categorized this type of processing as an early form of [[single instruction, multiple threads]] (SIMT).{{cn|date=July 2023}}

[[International Computers Limited]] sought to avoid many of the difficulties with the ILLIAC concept with its own [[Distributed Array Processor]] (DAP) design, categorising the ILLIAC and DAP as cellular array processors that potentially offered substantial performance benefits over conventional vector processor designs such as the CDC STAR-100 and Cray 1.<ref name="newscientist19760617_dap">{{ cite magazine | url=https://archive.org/details/bub_gb_m8S4bXj3dcMC/page/n11/mode/2up | title=Computers by the thousand | magazine=New Scientist | last1=Parkinson | first1=Dennis | date=17 June 1976 | access-date=7 July 2024 | pages=626–627 }}</ref>

===Computer for operations with functions===
A [[computer for operations with functions]] was presented and developed by Kartsev in 1967.<ref name="Malinovsky">{{cite book| title=The history of computer technology in their faces (in Russian)| author= B.N. Malinovsky |publisher=KIT |year=1995 |isbn=5770761318}}</ref>

=== Supercomputers ===
{{unreferenced section|date=July 2023}}

The first vector supercomputers are the [[Control Data Corporation]] [[STAR-100]] and [[Texas Instruments]] [[Advanced Scientific Computer]] (ASC), which were introduced in 1974 and 1972, respectively.<!--The STAR was announced before the ASC-->

The basic ASC (i.e., "one pipe") ALU used a pipeline architecture that supported both scalar and vector computations, with peak performance reaching approximately 20 MFLOPS, readily achieved when processing long vectors. Expanded ALU configurations supported "two pipes" or "four pipes" with a corresponding 2X or 4X performance gain. Memory bandwidth was sufficient to support these expanded modes.

The STAR-100 was otherwise slower than CDC's own supercomputers like the [[CDC 7600]], but at data-related tasks they could keep up while being much smaller and less expensive. However the machine also took considerable time decoding the vector instructions and getting ready to run the process, so it required very specific data sets to work on before it actually sped anything up.

The vector technique was first fully exploited in 1976 by the famous [[Cray-1]]. Instead of leaving the data in memory like the STAR-100 and ASC, the Cray design had eight [[vector registers]], which held sixty-four 64-bit words each. The vector instructions were applied between registers, which is much faster than talking to main memory. Whereas the STAR-100 would apply a single operation across a long vector in memory and then move on to the next operation, the Cray design would load a smaller section of the vector into registers and then apply as many operations as it could to that data, thereby avoiding many of the much slower memory access operations.

The Cray design used [[pipeline parallelism]] to implement vector instructions rather than multiple ALUs. In addition, the design had completely separate pipelines for different instructions, for example, addition/subtraction was implemented in different hardware than multiplication. This allowed a batch of vector instructions to be pipelined into each of the ALU subunits, a technique they called [[Chaining (vector processing)|''vector chaining'']]. The Cray-1 normally had a performance of about 80 MFLOPS, but with up to three chains running it could peak at 240&nbsp;MFLOPS and averaged around 150 – far faster than any machine of the era.

[[File:Cray J90 CPU module.jpg|thumb|[[Cray J90]] processor module with four scalar/vector processors]]

Other examples followed. [[Control Data Corporation]] tried to re-enter the high-end market again with its [[ETA-10]] machine, but it sold poorly and they took that as an opportunity to leave the supercomputing field entirely. In the early and mid-1980s Japanese companies ([[Fujitsu]], [[Hitachi]] and [[Nippon Electric Corporation]] (NEC) introduced register-based vector machines similar to the Cray-1, typically being slightly faster and much smaller. [[Oregon]]-based [[Floating Point Systems]] (FPS) built add-on array processors for [[minicomputer]]s, later building their own [[minisupercomputer]]s.

Throughout, Cray continued to be the performance leader, continually beating the competition with a series of machines that led to the [[Cray-2]], [[Cray X-MP]] and [[Cray Y-MP]]. Since then, the supercomputer market has focused much more on [[massively parallel]] processing rather than better implementations of vector processors. However, recognising the benefits of vector processing, IBM developed [[Virtual Vector Architecture]] for use in supercomputers coupling several scalar processors to act as a vector processor.

Although vector supercomputers resembling the Cray-1 are less popular these days, NEC has continued to make this type of computer up to the present day with their [[NEC SX architecture|SX series]] of computers. Most recently, the [[SX-Aurora TSUBASA]] places the processor and either 24 or 48 gigabytes of memory on an [[High Bandwidth Memory|HBM]] 2 module within a card that physically resembles a graphics coprocessor, but instead of serving as a co-processor, it is the main computer with the PC-compatible computer into which it is plugged serving support functions.

=== GPU ===
{{Main|Single instruction, multiple threads}}

Modern graphics processing units ([[GPUs]]) include an array of [[shaders|shader pipelines]] which may be driven by [[compute kernel]]s, and can be considered vector processors (using a similar strategy for hiding memory latencies). As shown in [[Flynn's taxonomy|Flynn's 1972 paper]] the key distinguishing factor of SIMT-based GPUs is that it has a single instruction decoder-broadcaster but that the cores receiving and executing that same instruction are otherwise reasonably normal: their own ALUs, their own register files, their own Load/Store units and their own independent L1 data caches. Thus although all cores simultaneously execute the exact same instruction in lock-step with each other they do so with completely different data from completely different memory locations. This is ''significantly'' more complex and involved than [[Flynn's Taxonomy#Pipelined processor|"Packed SIMD"]], which is strictly limited to execution of parallel pipelined arithmetic operations only. Although the exact internal details of today's commercial GPUs are proprietary secrets, the MIAOW<ref>[http://miaowgpu.org/ MIAOW Vertical Research Group]</ref> team was able to piece together anecdotal information sufficient to implement a subset of the AMDGPU architecture.<ref>[https://github.com/VerticalResearchGroup/miaow/wiki/Architecture-Overview MIAOW GPU]</ref>

=== Recent development ===

Several modern CPU architectures are being designed as vector processors. The [[RISC-V#Vector_set|RISC-V vector extension]] follows similar principles as the early vector processors, and is being implemented in commercial products such as the [[Andes Technology]] AX45MPV.<ref>{{cite press release | title = Andes Announces RISC-V Multicore 1024-bit Vector Processor: AX45MPV | url = https://www.globenewswire.com/en/news-release/2022/12/07/2569216/0/en/Andes-Announces-RISC-V-Multicore-1024-bit-Vector-Processor-AX45MPV.html | publisher = GlobeNewswire | date = 7 December 2022 | access-date = 23 December 2022}}</ref> There are also several [[open source]] vector processor architectures being developed, including [[Agner Fog#ForwardCom_instruction_set|ForwardCom]] and [[Libre-SOC]].