Editing Vector processor (section)

==== Pure (true) vector ISA ====

For Cray-style vector ISAs such as RVV, an instruction called "{{Not a typo|setvl}}" (set vector length) is used. The hardware first defines how many data values it can process in one "vector": this could be either actual registers or it could be an internal loop (the hybrid approach, mentioned above). This maximum amount (the number of hardware "lanes") is termed "MVL" (Maximum Vector Length). Note that, as seen in SX-Aurora and Videocore IV, MVL may be an actual hardware lane quantity ''or a virtual one''. ''(Note: As mentioned in the ARM SVE2 Tutorial, programmers '''must''' not make the mistake of assuming a fixed vector width: consequently MVL is not a quantity that the programmer needs to know. This can be a little disconcerting after years of SIMD mindset).''{{Tone inline|date=November 2021}}

On calling {{Not a typo|setvl}} with the number of outstanding data elements to be processed, "{{Not a typo|setvl}}" is permitted (essentially required) to limit that to the Maximum Vector Length (MVL) and thus returns the ''actual'' number that can be processed by the hardware in subsequent vector instructions, and sets the internal special register, "VL", to that same amount. ARM refers to this technique as "vector length agnostic" programming in its tutorials on SVE2.<ref>[https://developer.arm.com/documentation/102131/latest/ ARM SVE2 tutorial]</ref>

Below is the Cray-style vector assembler for the same SIMD style loop, above. Note that t0 (which, containing a convenient copy of VL, can vary) is used instead of hard-coded constants:

<syntaxhighlight lang=gas>
vloop:
  setvl   t0, n      # VL=t0=min(MVL, n)
  vld32   v0, x      # load vector x
  vld32   v1, y      # load vector y
  vmadd32 v1, v0, a  # v1 += v0 * a
  vst32   v1, y      # store Y
  add     y, t0*4    # advance y by VL*4
  add     x, t0*4    # advance x by VL*4
  sub     n, t0      # n -= VL (t0)
  bnez    n, vloop   # repeat if n != 0
</syntaxhighlight>

This is essentially not very different from the SIMD version (processes 4 data elements per loop), or from the initial Scalar version (processes just the one). n still contains the number of data elements remaining to be processed, but t0 contains the copy of VL – the number that is ''going'' to be processed in each iteration. t0 is subtracted from n after each iteration, and if n is zero then all elements have been processed.

A number of things to note, when comparing against the Predicated SIMD assembly variant:
# The {{code|setvl}} instruction has embedded within it a {{code|min}} instruction
# Where the SIMD variant hard-coded both the width (4) into the creation of the mask ''and'' in the SIMD width (load32x4 etc.) the vector ISA equivalents have no such limit. This makes vector programs both portable, Vendor Independent, and future-proof.
# Setting VL effectively ''creates a hidden predicate mask'' that is automatically applied to the vectors
# Where with predicated SIMD the mask bitlength is limited to that which may be held in a scalar (or special mask) register, vector ISA's mask registers have no such limitation. Cray-I vectors could be just over 1,000 elements (in 1977).

Thus it can be seen, very clearly, how vector ISAs reduce the number of instructions.

Also note, that just like the predicated SIMD variant, the pointers to x and y are advanced by t0 times four because they both point to 32 bit data, but that n is decremented by straight t0. Compared to the fixed-size SIMD assembler there is very little apparent difference: x and y are advanced by hard-coded constant 16, n is decremented by a hard-coded 4, so initially it is hard to appreciate the significance. The difference comes in the realisation that the vector hardware could be capable of doing 4 simultaneous operations, or 64, or 10,000, it would be the exact same vector assembler for all of them ''and there would still be no SIMD cleanup code''. Even compared to the predicate-capable SIMD, it is still more compact, clearer, more elegant and uses less resources.

Not only is it a much more compact program (saving on L1 Cache size), but as previously mentioned, the vector version can issue far more data processing to the ALUs, again saving power because Instruction Decode and Issue can sit idle.

Additionally, the number of elements going in to the function can start at zero. This sets the vector length to zero, which effectively disables all vector instructions, turning them into [[no-op]]s, at runtime. Thus, unlike non-predicated SIMD, even when there are no elements to process there is still no wasted cleanup code.