Editing Vector processor (section)

=== Vector instructions ===
{{See also|SIMD|Single Instruction Multiple Threads}}

The vector pseudocode example above comes with a big assumption that the vector computer can process more than ten numbers in one batch. For a greater quantity of numbers in the vector register, it becomes unfeasible for the computer to have a register that large. As a result, the vector processor either gains the ability to perform loops itself, or exposes some sort of vector control (status) register to the programmer, usually known as a vector Length.

The self-repeating instructions are found in early vector computers like the STAR-100, where the above action would be described in a single instruction (somewhat like {{code|1=vadd c, a, b, $10}}). They are also found in the [[x86]] architecture as the {{code|REP}} prefix. However, only very simple calculations can be done effectively in hardware this way without a very large cost increase. Since all operands have to be in memory for the STAR-100 architecture, the latency caused by access became huge too.

Broadcom included space in all vector operations of the [[Videocore]] IV ISA for a {{code|REP}} field, but unlike the STAR-100 which uses memory for its repeats, the Videocore IV repeats are on all operations including arithmetic vector operations. The repeat length can be a small range of [[power of two]] or sourced from one of the scalar registers.<ref>[https://github.com/hermanhermitage/videocoreiv/wiki/VideoCore-IV-Programmers-Manual Videocore IV Programmer's Manual]</ref>

The [[Cray-1]] introduced the idea of using [[processor register]]s to hold vector data in batches. The batch lengths (vector length, VL) could be dynamically set with a special instruction, the significance compared to Videocore IV (and, crucially as will be shown below, SIMD as well) being that the repeat length does not have to be part of the instruction encoding. This way, significantly more work can be done in each batch; the instruction encoding is much more elegant and compact as well. The only drawback is that in order to take full advantage of this extra batch processing capacity, the memory load and store speed correspondingly had to increase as well. This is sometimes claimed{{By whom|date=November 2021}} to be a disadvantage of Cray-style vector processors: in reality it is part of achieving high performance throughput, as seen in [[GPU]]s, which face exactly the same issue.

Modern SIMD computers claim to improve on early Cray by directly using multiple ALUs, for a higher degree of parallelism compared to only using the normal scalar pipeline. Modern vector processors (such as the [[SX-Aurora TSUBASA]]) combine both, by issuing multiple data to multiple internal pipelined SIMD ALUs, the number issued being dynamically chosen by the vector program at runtime. Masks can be used to selectively load and store data in memory locations, and use those same masks to selectively disable processing element of SIMD ALUs. Some processors with SIMD ([[AVX-512]], ARM [[Scalable Vector Extension|SVE2]]) are capable of this kind of selective, per-element ([[Predication (computer architecture)|"predicated"]]) processing, and it is these which somewhat deserve the nomenclature "vector processor" or at least deserve the claim of being capable of "vector processing". SIMD processors without per-element predication
([[MMX (instruction set)|MMX]], [[Streaming SIMD Extensions|SSE]], [[AltiVec]]) categorically do not.

Modern GPUs, which have many small compute units each with their own independent SIMD ALUs, use [[Single Instruction Multiple Threads]] (SIMT). SIMT units run from a shared single broadcast synchronised Instruction Unit. The "vector registers" are very wide and the pipelines tend to be long. The "threading" part of SIMT involves the way data is handled independently on each of the compute units.

In addition, GPUs such as the Broadcom [[Videocore]] IV and other external vector processors like the [[NEC SX-Aurora TSUBASA]] may use fewer vector units than the width implies: instead of having 64 units for a 64-number-wide register, the hardware might instead do a pipelined loop over 16 units for a hybrid approach. The Broadcom [[Videocore]] IV is also capable of this hybrid approach: nominally stating that its SIMD QPU Engine supports 16-long FP array operations in its instructions, it actually does them 4 at a time, as (another) form of "threads".<ref>[https://jbush001.github.io/2016/03/02/videocore-qpu-pipeline.html Videocore IV QPU analysis by Jeff Bush]</ref>