Editing Vector processor (section)

==== Pure (non-predicated, packed) SIMD ====

A modern packed SIMD architecture, known by many names (listed in [[Flynn's taxonomy#Pipelined processor|Flynn's taxonomy]]), can do most of the operation in batches. The code is mostly similar to the scalar version. It is assumed that both x and y are [[Data structure alignment|properly aligned]] here (only start on a multiple of 16) and that n is a multiple of 4, as otherwise some setup code would be needed to calculate a mask or to run a scalar version. It can also be assumed, for simplicity, that the SIMD instructions have an option to automatically repeat scalar operands, like ARM NEON can.<ref>{{Cite web|url=https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/coding-for-neon---part-3-matrix-multiplication|title = Coding for Neon - Part 3 Matrix Multiplication| date=11 September 2013 }}</ref> If it does not, a "splat" (broadcast) must be used, to copy the scalar argument across a SIMD register:

<syntaxhighlight lang=gas>
  splatx4   v4, a        ; v4 = a,a,a,a
</syntaxhighlight>

The time taken would be basically the same as a vector implementation of {{code|1=y = mx + c}} described above.

<syntaxhighlight lang=gas>
vloop:
  load32x4  v1, x
  load32x4  v2, y
  mul32x4   v1, a, v1  ; v1 := v1 * a
  add32x4   v3, v1, v2 ; v3 := v1 + v2
  store32x4 v3, y
  addl      x, x, $16  ; x := x + 16
  addl      y, y, $16
  subl      n, n, $4   ; n := n - 4
  jgz       n, vloop   ; go back if n > 0
out:
  ret
</syntaxhighlight>

Note that both x and y pointers are incremented by 16, because that is how long (in bytes) four 32-bit integers are. The decision was made that the algorithm ''shall'' only cope with 4-wide SIMD, therefore the constant is hard-coded into the program.

Unfortunately for SIMD, the clue was in the assumption above, "that n is a multiple of 4" as well as "aligned access", which, clearly, is a limited specialist use-case.

Realistically, for general-purpose loops such as in portable libraries, where n cannot be limited in this way, the overhead of setup and cleanup for SIMD in order to cope with non-multiples of the SIMD width, can far exceed the instruction count inside the loop itself. Assuming worst-case that the hardware cannot do misaligned SIMD memory accesses, a real-world algorithm will:
* first have to have a preparatory section which works on the beginning unaligned data, up to the first point where SIMD memory-aligned operations can take over. this will either involve (slower) scalar-only operations or smaller-sized packed SIMD operations. Each copy implements the full algorithm inner loop.
* perform the aligned SIMD loop at the maximum SIMD width up until the last few elements (those remaining that do not fit the fixed SIMD width)
* have a cleanup phase which, like the preparatory section, is just as large and just as complex.
Eight-wide SIMD requires repeating the inner loop algorithm first with four-wide SIMD elements, then two-wide SIMD, then one (scalar), with a test and branch in between each one, in order to cover the first and last remaining SIMD elements (0 <= n <= 7).

This more than ''triples'' the size of the code, in fact in extreme cases it results in an ''order of magnitude'' increase in instruction count! This can easily be demonstrated by compiling the iaxpy example for [[AVX-512]], using the options {{code|1="-O3 -march=knl"}} to [[GNU Compiler Collection|gcc]].

Over time as the ISA evolves to keep increasing performance, it results in ISA Architects adding 2-wide SIMD, then 4-wide SIMD, then 8-wide and upwards. It can therefore be seen why [[AVX-512]] exists in x86.

Without predication, the wider the SIMD width the worse the problems get, leading to massive opcode proliferation, degraded performance, extra power consumption and unnecessary software complexity.<ref>[https://www.sigarch.org/simd-instructions-considered-harmful/ SIMD considered harmful]</ref>

Vector processors on the other hand are designed to issue computations of variable length for an arbitrary count, n, and thus require very little setup, and no cleanup. Even compared to those SIMD ISAs which have masks (but no {{code|setvl}} instruction), Vector processors produce much more compact code because they do not need to perform explicit mask calculation to cover the last few elements (illustrated below).