Editing Vector processor (section)

=== Vector reduction example ===

This example starts with an algorithm which involves reduction. Just as with the previous example, it will be first shown in scalar instructions, then SIMD, and finally vector instructions, starting in [[C (programming language)|c]]:
<syntaxhighlight lang=c>
void (size_t n, int a, const int x[]) {
    int y = 0;
    for (size_t i = 0; i < n; i++)
        y += x[i];
    return y;
}
</syntaxhighlight>

Here, an accumulator (y) is used to sum up all the values in the array, x.

==== Scalar assembler ====

The scalar version of this would load each of x, add it to y, and loop:

<syntaxhighlight lang=gas>
  set     y, 0     ; y initialised to zero
loop:
  load32  r1, x    ; load one 32bit data
  add32   y, y, r1 ; y := y + r1
  addl    x, x, $4 ; x := x + 4
  subl    n, n, $1 ; n := n - 1
  jgz     n, loop  ; loop back if n > 0
out:
  ret y            ; returns result, y
</syntaxhighlight>

This is very straightforward. "y" starts at zero, 32 bit integers are loaded one at a time into r1, added to y, and the address of the array "x" moved on to the next element in the array.

==== SIMD reduction ====

This is where the problems start. SIMD by design is incapable of doing arithmetic operations "inter-element". Element 0 of one SIMD register may be added to Element 0 of another register, but Element 0 may '''not''' be added to anything '''other''' than another Element 0. This places some severe limitations on potential implementations. For simplicity it can be assumed that n is exactly 8:

<syntaxhighlight lang=gas>
  addl      r3, x, $16 ; for 2nd 4 of x
  load32x4  v1, x      ; first 4 of x
  load32x4  v2, r3     ; 2nd 4 of x
  add32x4   v1, v2, v1 ; add 2 groups
</syntaxhighlight>

At this point four adds have been performed:
* {{code|x[0]+x[4]}} - First SIMD ADD: element 0 of first group added to element 0 of second group
* {{code|x[1]+x[5]}} - Second SIMD ADD: element 1 of first group added to element 1 of second group
* {{code|x[2]+x[6]}} - Third SIMD ADD: element 2 of first group added to element 2 of second group
* {{code|x[3]+x[7]}} - Fourth SIMD ADD: element 3 of first group added to element 3 of second group
but with 4-wide SIMD being incapable '''by design''' of adding {{code|x[0]+x[1]}} for example, things go rapidly downhill just as they did with the general case of using SIMD for general-purpose IAXPY loops. To sum the four partial results, two-wide SIMD can be used, followed by a single scalar add, to finally produce the answer, but, frequently, the data must be transferred out of dedicated SIMD registers before the last scalar computation can be performed.

Even with a general loop (n not fixed), the only way to use 4-wide SIMD is to assume four separate "streams", each offset by four elements. Finally, the four partial results have to be summed. Other techniques involve shuffle:
examples online can be found for [[AVX-512]] of how to do "Horizontal Sum"<ref>{{Cite web|url=https://stackoverflow.com/questions/52782940/1-to-4-broadcast-and-4-to-1-reduce-in-avx-512|title = Sse - 1-to-4 broadcast and 4-to-1 reduce in AVX-512}}</ref><ref>{{Cite web|url=https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-or-other-reduction/35270026#35270026|title = Assembly - Fastest way to do horizontal SSE vector sum (Or other reduction)}}</ref>

Aside from the size of the program and the complexity, an additional potential problem arises if floating-point computation is involved: the fact that the values are not being summed in strict order (four partial results) could result in rounding errors.

==== Vector ISA reduction ====

Vector instruction sets have arithmetic reduction operations ''built-in'' to the ISA. If it is assumed that n is less or equal to the maximum vector length, only three instructions are required:

<syntaxhighlight lang=gas>
  setvl      t0, n  # VL=t0=min(MVL, n)
  vld32      v0, x  # load vector x
  vredadd32  y, v0  # reduce-add into y
</syntaxhighlight>

The code when n is larger than the maximum vector length is not that much more complex, and is a similar pattern to the first example ("IAXPY").

<syntaxhighlight lang=gas>
  set     y, 0
vloop:
  setvl   t0, n      # VL=t0=min(MVL, n)
  vld32   v0, x      # load vector x
  vredadd32 y, y, v0 # add all x into y
  add     x, t0*4    # advance x by VL*4
  sub     n, t0      # n -= VL (t0)
  bnez    n, vloop   # repeat if n != 0
  ret y
</syntaxhighlight>

The simplicity of the algorithm is stark in comparison to SIMD. Again, just as with the IAXPY example, the algorithm is length-agnostic (even on Embedded implementations where maximum vector length could be only one).

Implementations in hardware may, if they are certain that the right answer will be produced, perform the reduction in parallel. Some vector ISAs offer a parallel reduction mode as an explicit option, for when the programmer knows that any potential rounding errors do not matter, and low latency is critical.<ref>{{Cite web|url=https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-reduction-operations|title = Riscv-v-spec/V-spec.adoc at master · riscv/Riscv-v-spec|website = [[GitHub]]| date=19 November 2022 }}</ref>

This example again highlights a key critical fundamental difference between true vector processors and those SIMD processors, including most commercial GPUs, which are inspired by features of vector processors.