Editing Vector processor (section)

=== Difference between SIMD and vector processors ===

SIMD instruction sets lack crucial features when compared to vector instruction sets. The most important of these is that vector processors, inherently by definition and design, have always been variable-length since their inception.

Whereas pure (fixed-width, no predication) SIMD is often mistakenly claimed to be "vector" (because SIMD processes data which happens to be vectors), through close analysis and comparison of historic and modern ISAs, actual vector ISAs may be observed to have the following features that no SIMD ISA has:{{citation needed|reason=See [[Talk:Vector processor#Discernable features]]|date=June 2021}}
* a way to set the vector length, such as the {{code|vsetvl}} instruction in RISCV RVV,<ref>{{Cite web|url=https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#sec-vector-config|title = Riscv-v-spec/V-spec.adoc at master · riscv/Riscv-v-spec|website = [[GitHub]]| date=16 June 2023 }}</ref> or the {{code|lvl}} instruction in NEC SX,<ref>{{Cite web|url=https://sxauroratsubasa.sakura.ne.jp/documents/sdk/pdfs/VectorEngine-as-manual-v1.3.pdf|title=Vector Engine Assembly Language Reference Manual|date=16 June 2023}}</ref> without restricting the length to a [[power of two]] or to a multiple of a fixed data width.
* Iteration and reduction over elements {{em|within}} vectors.

Predicated SIMD (part of [[Flynn's taxonomy]]) which is comprehensive individual element-level predicate masks on every vector instruction as is now available in ARM SVE2.<ref>{{Cite web|url=https://developer.arm.com/tools-and-software/server-and-hpc/compile/arm-instruction-emulator/resources/tutorials/sve/sve-vs-sve2/single-page|title = Documentation – Arm Developer}}</ref> And [[AVX-512]], almost qualifies as a vector processor.{{how?|date=December 2023}} Predicated SIMD uses fixed-width SIMD ALUs but allows locally controlled (predicated) activation of units to provide the appearance of variable length vectors. Examples below help explain these categorical distinctions.

SIMD, because it uses fixed-width batch processing, is {{em|unable by design}} to cope with iteration and reduction. This is illustrated further with examples, below.

[[File:Simd vs vector.png|thumb|500px]]

Additionally, vector processors can be more resource-efficient by using slower hardware and saving power, but still achieving throughput and having less latency than SIMD, through [[Chaining (vector processing)|vector chaining]].<ref>{{Cite web|url=http://thebeardsage.com/vector-architecture/|title = Vector Architecture|date = 27 April 2020}}</ref><ref>[http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture11-vector.pdf Vector and SIMD processors, slides 12-13]</ref>

Consider both a SIMD processor and a vector processor working on 4 64-bit elements, doing a LOAD, ADD, MULTIPLY and STORE sequence. If the SIMD width is 4, then the SIMD processor must LOAD four elements entirely before it can move on to the ADDs, must complete all the ADDs before it can move on to the MULTIPLYs, and likewise must complete all of the MULTIPLYs before it can start the STOREs. This is by definition and by design.<ref>[https://course.ece.cmu.edu/~ece740/f13/lib/exe/fetch.php?media=onur-740-fall13-module5.1.1-simd-and-gpus-part1.pdf Array vs Vector Processing, slides 5-7]</ref>

Having to perform 4-wide simultaneous 64-bit LOADs and 64-bit STOREs is very costly in hardware (256 bit data paths to memory). Having 4x 64-bit ALUs, especially MULTIPLY, likewise. To avoid these high costs, a SIMD processor would have to have 1-wide 64-bit LOAD, 1-wide 64-bit STORE, and only 2-wide 64-bit ALUs. As shown in the diagram, which assumes a [[Superscalar processor|multi-issue execution model]], the consequences are that the operations now take longer to complete. If multi-issue is not possible, then the operations take even longer because the LD may not be issued (started) at the same time as the first ADDs, and so on. If there are only 4-wide 64-bit SIMD ALUs, the completion time is even worse: only when all four LOADs have completed may the SIMD operations start, and only when all ALU operations have completed may the STOREs begin.

A vector processor, by contrast, even if it is ''single-issue'' and uses no SIMD ALUs, only having 1-wide 64-bit LOAD, 1-wide 64-bit STORE (and, as in the [[Cray-1]], the ability to run MULTIPLY simultaneously with ADD), may complete the four operations faster than a SIMD processor with 1-wide LOAD, 1-wide STORE, and 2-wide SIMD. This more efficient resource utilization, due to [[Chaining (vector processing)|vector chaining]], is a key advantage and difference compared to SIMD. SIMD, by design and definition, cannot perform chaining except to the entire group of results.<ref>[https://course.ece.cmu.edu/~ece740/f13/lib/exe/fetch.php?media=seth-740-fall13-module5.1-simd-vector-gpu.pdf SIMD vs Vector GPU, slides 22-24]</ref>