Editing Vector processor (section)

=== Vector instruction example ===

This example starts with an algorithm ("IAXPY"), first show it in scalar instructions, then SIMD, then predicated SIMD, and finally vector instructions. This incrementally helps illustrate the difference between a traditional vector processor and a modern SIMD one. The example starts with a 32-bit integer variant of the "DAXPY" function, in [[C (programming language)|C]]:

<syntaxhighlight lang=c>
void iaxpy(size_t n, int a, const int x[], int y[]) {
    for (size_t i = 0; i < n; i++)
        y[i] = a * x[i] + y[i];
}
</syntaxhighlight>

In each iteration, every element of y has an element of x multiplied by a and added to it. The program is expressed in scalar linear form for readability.

==== Scalar assembler ====

The scalar version of this would load one of each of x and y, process one calculation, store one result, and loop:

<syntaxhighlight lang=gas>
loop:
  load32  r1, x      ; load one 32bit data
  load32  r2, y
  mul32   r1, a, r1  ; r1 := r1 * a
  add32   r3, r1, r2 ; r3 := r1 + r2
  store32 r3, y
  addl    x, x, $4   ; x := x + 4
  addl    y, y, $4
  subl    n, n, $1   ; n := n - 1
  jgz     n, loop    ; loop back if n > 0
out:
  ret
</syntaxhighlight>

The STAR-like code remains concise, but because the STAR-100's vectorisation was by design based around memory accesses, an extra slot of memory is now required to process the information. Two times the latency is also needed due to the extra requirement of memory access.

<syntaxhighlight lang=gas>
  ; Assume tmp is pre-allocated
  vmul tmp, a, x, n ; tmp[i] = a * x[i]
  vadd y, y, tmp, n ; y[i] = y[i] + tmp[i]
  ret
</syntaxhighlight>

==== Pure (non-predicated, packed) SIMD ====

A modern packed SIMD architecture, known by many names (listed in [[Flynn's taxonomy#Pipelined processor|Flynn's taxonomy]]), can do most of the operation in batches. The code is mostly similar to the scalar version. It is assumed that both x and y are [[Data structure alignment|properly aligned]] here (only start on a multiple of 16) and that n is a multiple of 4, as otherwise some setup code would be needed to calculate a mask or to run a scalar version. It can also be assumed, for simplicity, that the SIMD instructions have an option to automatically repeat scalar operands, like ARM NEON can.<ref>{{Cite web|url=https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/coding-for-neon---part-3-matrix-multiplication|title = Coding for Neon - Part 3 Matrix Multiplication| date=11 September 2013 }}</ref> If it does not, a "splat" (broadcast) must be used, to copy the scalar argument across a SIMD register:

<syntaxhighlight lang=gas>
  splatx4   v4, a        ; v4 = a,a,a,a
</syntaxhighlight>

The time taken would be basically the same as a vector implementation of {{code|1=y = mx + c}} described above.

<syntaxhighlight lang=gas>
vloop:
  load32x4  v1, x
  load32x4  v2, y
  mul32x4   v1, a, v1  ; v1 := v1 * a
  add32x4   v3, v1, v2 ; v3 := v1 + v2
  store32x4 v3, y
  addl      x, x, $16  ; x := x + 16
  addl      y, y, $16
  subl      n, n, $4   ; n := n - 4
  jgz       n, vloop   ; go back if n > 0
out:
  ret
</syntaxhighlight>

Note that both x and y pointers are incremented by 16, because that is how long (in bytes) four 32-bit integers are. The decision was made that the algorithm ''shall'' only cope with 4-wide SIMD, therefore the constant is hard-coded into the program.

Unfortunately for SIMD, the clue was in the assumption above, "that n is a multiple of 4" as well as "aligned access", which, clearly, is a limited specialist use-case.

Realistically, for general-purpose loops such as in portable libraries, where n cannot be limited in this way, the overhead of setup and cleanup for SIMD in order to cope with non-multiples of the SIMD width, can far exceed the instruction count inside the loop itself. Assuming worst-case that the hardware cannot do misaligned SIMD memory accesses, a real-world algorithm will:
* first have to have a preparatory section which works on the beginning unaligned data, up to the first point where SIMD memory-aligned operations can take over. this will either involve (slower) scalar-only operations or smaller-sized packed SIMD operations. Each copy implements the full algorithm inner loop.
* perform the aligned SIMD loop at the maximum SIMD width up until the last few elements (those remaining that do not fit the fixed SIMD width)
* have a cleanup phase which, like the preparatory section, is just as large and just as complex.
Eight-wide SIMD requires repeating the inner loop algorithm first with four-wide SIMD elements, then two-wide SIMD, then one (scalar), with a test and branch in between each one, in order to cover the first and last remaining SIMD elements (0 <= n <= 7).

This more than ''triples'' the size of the code, in fact in extreme cases it results in an ''order of magnitude'' increase in instruction count! This can easily be demonstrated by compiling the iaxpy example for [[AVX-512]], using the options {{code|1="-O3 -march=knl"}} to [[GNU Compiler Collection|gcc]].

Over time as the ISA evolves to keep increasing performance, it results in ISA Architects adding 2-wide SIMD, then 4-wide SIMD, then 8-wide and upwards. It can therefore be seen why [[AVX-512]] exists in x86.

Without predication, the wider the SIMD width the worse the problems get, leading to massive opcode proliferation, degraded performance, extra power consumption and unnecessary software complexity.<ref>[https://www.sigarch.org/simd-instructions-considered-harmful/ SIMD considered harmful]</ref>

Vector processors on the other hand are designed to issue computations of variable length for an arbitrary count, n, and thus require very little setup, and no cleanup. Even compared to those SIMD ISAs which have masks (but no {{code|setvl}} instruction), Vector processors produce much more compact code because they do not need to perform explicit mask calculation to cover the last few elements (illustrated below).

==== Predicated SIMD ====

Assuming a hypothetical predicated (mask capable) SIMD ISA, and again assuming that the SIMD instructions can cope with misaligned data, the instruction loop would look like this:

<syntaxhighlight lang=gas>
vloop:
  # prepare mask. few ISAs have min though
  min       t0, n, $4     ; t0 = min(n, 4)
  shift     m, $1, t0     ; m = 1<<t0
  sub       m, m, $1      ; m = (1<<t0)-1
  # now do the operation, masked by m bits
  load32x4  v1, x, m
  load32x4  v2, y, m
  mul32x4   v1, a, v1, m  ; v1 := v1 * a
  add32x4   v3, v1, v2, m ; v3 := v1 + v2
  store32x4 v3, y, m
  # update x, y and n for next loop
  addl      x, t0*4      ; x := x + t0*4
  addl      y, t0*4
  subl      n, n, t0     ; n := n - t0
  # loop?
  jgz       n, vloop     ; go back if n > 0
out:
  ret
</syntaxhighlight>

Here it can be seen that the code is much cleaner but a little complex: at least, however, there is no setup or cleanup: on the last iteration of the loop, the predicate mask will be set to either 0b0000, 0b0001, 0b0011, 0b0111 or 0b1111, resulting in between 0 and 4 SIMD element operations being performed, respectively.
One additional potential complication: some RISC ISAs do not have a "min" instruction, needing instead to use a branch or scalar predicated compare.

It is clear how predicated SIMD at least merits the term "vector capable", because it can cope with variable-length vectors by using predicate masks. The final evolving step to a "true" vector ISA, however, is to not have any evidence in the ISA ''at all'' of a SIMD width, leaving that entirely up to the hardware.

==== Pure (true) vector ISA ====

For Cray-style vector ISAs such as RVV, an instruction called "{{Not a typo|setvl}}" (set vector length) is used. The hardware first defines how many data values it can process in one "vector": this could be either actual registers or it could be an internal loop (the hybrid approach, mentioned above). This maximum amount (the number of hardware "lanes") is termed "MVL" (Maximum Vector Length). Note that, as seen in SX-Aurora and Videocore IV, MVL may be an actual hardware lane quantity ''or a virtual one''. ''(Note: As mentioned in the ARM SVE2 Tutorial, programmers '''must''' not make the mistake of assuming a fixed vector width: consequently MVL is not a quantity that the programmer needs to know. This can be a little disconcerting after years of SIMD mindset).''{{Tone inline|date=November 2021}}

On calling {{Not a typo|setvl}} with the number of outstanding data elements to be processed, "{{Not a typo|setvl}}" is permitted (essentially required) to limit that to the Maximum Vector Length (MVL) and thus returns the ''actual'' number that can be processed by the hardware in subsequent vector instructions, and sets the internal special register, "VL", to that same amount. ARM refers to this technique as "vector length agnostic" programming in its tutorials on SVE2.<ref>[https://developer.arm.com/documentation/102131/latest/ ARM SVE2 tutorial]</ref>

Below is the Cray-style vector assembler for the same SIMD style loop, above. Note that t0 (which, containing a convenient copy of VL, can vary) is used instead of hard-coded constants:

<syntaxhighlight lang=gas>
vloop:
  setvl   t0, n      # VL=t0=min(MVL, n)
  vld32   v0, x      # load vector x
  vld32   v1, y      # load vector y
  vmadd32 v1, v0, a  # v1 += v0 * a
  vst32   v1, y      # store Y
  add     y, t0*4    # advance y by VL*4
  add     x, t0*4    # advance x by VL*4
  sub     n, t0      # n -= VL (t0)
  bnez    n, vloop   # repeat if n != 0
</syntaxhighlight>

This is essentially not very different from the SIMD version (processes 4 data elements per loop), or from the initial Scalar version (processes just the one). n still contains the number of data elements remaining to be processed, but t0 contains the copy of VL – the number that is ''going'' to be processed in each iteration. t0 is subtracted from n after each iteration, and if n is zero then all elements have been processed.

A number of things to note, when comparing against the Predicated SIMD assembly variant:
# The {{code|setvl}} instruction has embedded within it a {{code|min}} instruction
# Where the SIMD variant hard-coded both the width (4) into the creation of the mask ''and'' in the SIMD width (load32x4 etc.) the vector ISA equivalents have no such limit. This makes vector programs both portable, Vendor Independent, and future-proof.
# Setting VL effectively ''creates a hidden predicate mask'' that is automatically applied to the vectors
# Where with predicated SIMD the mask bitlength is limited to that which may be held in a scalar (or special mask) register, vector ISA's mask registers have no such limitation. Cray-I vectors could be just over 1,000 elements (in 1977).

Thus it can be seen, very clearly, how vector ISAs reduce the number of instructions.

Also note, that just like the predicated SIMD variant, the pointers to x and y are advanced by t0 times four because they both point to 32 bit data, but that n is decremented by straight t0. Compared to the fixed-size SIMD assembler there is very little apparent difference: x and y are advanced by hard-coded constant 16, n is decremented by a hard-coded 4, so initially it is hard to appreciate the significance. The difference comes in the realisation that the vector hardware could be capable of doing 4 simultaneous operations, or 64, or 10,000, it would be the exact same vector assembler for all of them ''and there would still be no SIMD cleanup code''. Even compared to the predicate-capable SIMD, it is still more compact, clearer, more elegant and uses less resources.

Not only is it a much more compact program (saving on L1 Cache size), but as previously mentioned, the vector version can issue far more data processing to the ALUs, again saving power because Instruction Decode and Issue can sit idle.

Additionally, the number of elements going in to the function can start at zero. This sets the vector length to zero, which effectively disables all vector instructions, turning them into [[no-op]]s, at runtime. Thus, unlike non-predicated SIMD, even when there are no elements to process there is still no wasted cleanup code.