Editing Vector processor (section)

== Comparison with modern architectures ==

{{As of | 2016}} most commodity CPUs implement architectures that feature fixed-length SIMD instructions. On first inspection these can be considered a form of vector processing because they operate on multiple (vectorized, explicit length) data sets, and borrow features from vector processors. However, by definition, the addition of SIMD cannot, by itself, qualify a processor as an actual ''vector processor'', because SIMD is {{em|fixed-length}}, and vectors are {{em|variable-length}}. The difference is illustrated below with examples, showing and comparing the three categories: Pure SIMD, Predicated SIMD, and Pure Vector Processing.{{citation needed|date=June 2021}}

* '''Pure (fixed) SIMD''' - also known as "Packed SIMD",<ref>{{cite conference|first1=Y.|last1=Miyaoka|first2=J.|last2=Choi|first3=N.|last3=Togawa|first4=M.|last4=Yanagisawa|first5=T.|last5=Ohtsuki|title=An algorithm of hardware unit generation for processor core synthesis with packed SIMD type instructions|conference=Asia-Pacific Conference on Circuits and Systems|date=2002|pages=171–176|volume=1|doi=10.1109/APCCAS.2002.1114930|hdl=2065/10689|hdl-access=free}}</ref> [[SIMD within a register]] (SWAR), and [[Flynn's taxonomy#Pipelined processor|Pipelined Processor]] in Flynn's Taxonomy. Common examples using SIMD with features inspired by vector processors include: Intel x86's [[MMX (instruction set)|MMX]], [[Streaming SIMD Extensions|SSE]] and [[Advanced Vector Extensions|AVX]] instructions, AMD's [[3DNow!]] extensions, [[ARM NEON]], Sparc's [[Visual Instruction Set|VIS]] extension, [[PowerPC]]'s [[AltiVec]] and MIPS' [[MIPS_architecture#Application-specific_extensions|MSA]]. In 2000, [[IBM]], [[Toshiba]] and [[Sony]] collaborated to create the [[Cell processor]], which is also SIMD.
* '''Predicated SIMD''' - also known as [[Flynn's taxonomy#Associative processor|associative processing]]. Two notable examples which have per-element (lane-based) predication are [[Scalable Vector Extension|ARM SVE2]] and [[AVX-512]]
* '''Pure Vectors''' - as categorised in [[Duncan's taxonomy#Pipelined vector processors|Duncan's taxonomy]] - these include the original [[Cray-1]], [[Convex Computer|Convex C-Series]], [[NEC SX]], and [[RISC-V#Vector set|RISC-V RVV]]. Although memory-based, the [[CDC STAR-100]] was also a vector processor.

Other CPU designs include some multiple instructions for vector processing on multiple (vectorized) data sets, typically known as [[MIMD]] (Multiple Instruction, Multiple Data) and realized with [[VLIW]] (Very Long Instruction Word) and [[Explicitly parallel instruction computing|EPIC]] (Explicitly Parallel Instruction Computing). The [[Fujitsu FR-V]] VLIW/vector processor combines both technologies.

=== Difference between SIMD and vector processors ===

SIMD instruction sets lack crucial features when compared to vector instruction sets. The most important of these is that vector processors, inherently by definition and design, have always been variable-length since their inception.

Whereas pure (fixed-width, no predication) SIMD is often mistakenly claimed to be "vector" (because SIMD processes data which happens to be vectors), through close analysis and comparison of historic and modern ISAs, actual vector ISAs may be observed to have the following features that no SIMD ISA has:{{citation needed|reason=See [[Talk:Vector processor#Discernable features]]|date=June 2021}}
* a way to set the vector length, such as the {{code|vsetvl}} instruction in RISCV RVV,<ref>{{Cite web|url=https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#sec-vector-config|title = Riscv-v-spec/V-spec.adoc at master · riscv/Riscv-v-spec|website = [[GitHub]]| date=16 June 2023 }}</ref> or the {{code|lvl}} instruction in NEC SX,<ref>{{Cite web|url=https://sxauroratsubasa.sakura.ne.jp/documents/sdk/pdfs/VectorEngine-as-manual-v1.3.pdf|title=Vector Engine Assembly Language Reference Manual|date=16 June 2023}}</ref> without restricting the length to a [[power of two]] or to a multiple of a fixed data width.
* Iteration and reduction over elements {{em|within}} vectors.

Predicated SIMD (part of [[Flynn's taxonomy]]) which is comprehensive individual element-level predicate masks on every vector instruction as is now available in ARM SVE2.<ref>{{Cite web|url=https://developer.arm.com/tools-and-software/server-and-hpc/compile/arm-instruction-emulator/resources/tutorials/sve/sve-vs-sve2/single-page|title = Documentation – Arm Developer}}</ref> And [[AVX-512]], almost qualifies as a vector processor.{{how?|date=December 2023}} Predicated SIMD uses fixed-width SIMD ALUs but allows locally controlled (predicated) activation of units to provide the appearance of variable length vectors. Examples below help explain these categorical distinctions.

SIMD, because it uses fixed-width batch processing, is {{em|unable by design}} to cope with iteration and reduction. This is illustrated further with examples, below.

[[File:Simd vs vector.png|thumb|500px]]

Additionally, vector processors can be more resource-efficient by using slower hardware and saving power, but still achieving throughput and having less latency than SIMD, through [[Chaining (vector processing)|vector chaining]].<ref>{{Cite web|url=http://thebeardsage.com/vector-architecture/|title = Vector Architecture|date = 27 April 2020}}</ref><ref>[http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture11-vector.pdf Vector and SIMD processors, slides 12-13]</ref>

Consider both a SIMD processor and a vector processor working on 4 64-bit elements, doing a LOAD, ADD, MULTIPLY and STORE sequence. If the SIMD width is 4, then the SIMD processor must LOAD four elements entirely before it can move on to the ADDs, must complete all the ADDs before it can move on to the MULTIPLYs, and likewise must complete all of the MULTIPLYs before it can start the STOREs. This is by definition and by design.<ref>[https://course.ece.cmu.edu/~ece740/f13/lib/exe/fetch.php?media=onur-740-fall13-module5.1.1-simd-and-gpus-part1.pdf Array vs Vector Processing, slides 5-7]</ref>

Having to perform 4-wide simultaneous 64-bit LOADs and 64-bit STOREs is very costly in hardware (256 bit data paths to memory). Having 4x 64-bit ALUs, especially MULTIPLY, likewise. To avoid these high costs, a SIMD processor would have to have 1-wide 64-bit LOAD, 1-wide 64-bit STORE, and only 2-wide 64-bit ALUs. As shown in the diagram, which assumes a [[Superscalar processor|multi-issue execution model]], the consequences are that the operations now take longer to complete. If multi-issue is not possible, then the operations take even longer because the LD may not be issued (started) at the same time as the first ADDs, and so on. If there are only 4-wide 64-bit SIMD ALUs, the completion time is even worse: only when all four LOADs have completed may the SIMD operations start, and only when all ALU operations have completed may the STOREs begin.

A vector processor, by contrast, even if it is ''single-issue'' and uses no SIMD ALUs, only having 1-wide 64-bit LOAD, 1-wide 64-bit STORE (and, as in the [[Cray-1]], the ability to run MULTIPLY simultaneously with ADD), may complete the four operations faster than a SIMD processor with 1-wide LOAD, 1-wide STORE, and 2-wide SIMD. This more efficient resource utilization, due to [[Chaining (vector processing)|vector chaining]], is a key advantage and difference compared to SIMD. SIMD, by design and definition, cannot perform chaining except to the entire group of results.<ref>[https://course.ece.cmu.edu/~ece740/f13/lib/exe/fetch.php?media=seth-740-fall13-module5.1-simd-vector-gpu.pdf SIMD vs Vector GPU, slides 22-24]</ref>