Editing Vector processor (section)

== Vector processor features ==

Where many SIMD ISAs borrow or are inspired by the list below, typical features that a vector processor will have are:<ref>[http://www.lanl.gov/conferencess/salishan/salishan2004/scott.pdf Cray Overview]</ref><ref>[https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc RISC-V RVV ISA]</ref><ref>[https://sx-aurora.github.io/posts/VE-HW-overview/ SX-Arora Overview]</ref>

* '''Vector Load and Store''' – Vector architectures with a register-to-register design (analogous to load&ndash;store architectures for scalar processors) have instructions for transferring multiple elements between the memory and the vector registers. Typically, multiple addressing modes are supported. The unit-stride addressing mode is essential; modern vector architectures typically also support arbitrary constant strides, as well as the scatter/gather (also called ''indexed'') addressing mode. Advanced architectures may also include support for ''segment'' load and stores, and ''fail-first'' variants of the standard vector load and stores. Segment loads read a vector from memory, where each element is a [[data structure]] containing multiple members. The members are extracted from data structure (element), and each extracted member is placed into a different vector register.
* '''Masked Operations''' – [[Predication (computer architecture)|predicate masks]] allow parallel if/then/else constructs without resorting to branches. This allows code with conditional statements to be vectorized.
* '''Compress and Expand''' – usually using a bit-mask, data is linearly compressed or expanded (redistributed) based on whether bits in the mask are set or clear, whilst always preserving the sequential order and never duplicating values (unlike Gather-Scatter aka permute). These instructions feature in [[AVX-512#Compress and expand|AVX-512]].
* '''Register Gather, Scatter (aka permute)'''<ref>[https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions RVV register gather-scatter instructions]</ref> – a less restrictive more generic variation of the compress/expand theme which instead takes one vector to specify the indices to use to "reorder" another vector. Gather/scatter is more complex to implement than compress/expand, and, being inherently non-sequential, can interfere with [[Chaining (vector processing)|vector chaining]]. Not to be confused with [[Gather-scatter]] Memory Load/Store modes, Gather/scatter vector operations act on the vector registers, and are often termed a [[permute instruction]] instead.
* '''Splat and Extract''' – useful for interaction between scalar and vector, these broadcast a single value across a vector, or extract one item from a vector, respectively.
* '''Iota''' – a very simple and strategically useful instruction which drops sequentially-incrementing immediates into successive elements. Usually starts from zero.
* '''Reduction and [[Iteration#Computing|Iteration]]''' – operations that perform [[mapreduce]] on a vector (for example, find the one maximum value of an entire vector, or sum all elements). Iteration is of the form <code>x[i] = y[i] + x[i-1]</code> where Reduction is of the form <code>x = y[0] + y[1]… + y[n-1]</code>
* '''Matrix Multiply support''' – either by way of algorithmically loading data from memory, or reordering (remapping) the normally linear access to vector elements, or providing "Accumulators", arbitrary-sized matrices may be efficiently processed. IBM POWER10 provides MMA instructions<ref>{{Cite web|url=https://m.youtube.com/watch?v=27VRdI2BGWg&t=1260 |archive-url=https://ghostarchive.org/varchive/youtube/20211211/27VRdI2BGWg| archive-date=2021-12-11 |url-status=live|title = IBM's POWER10 Processor - William Starke & Brian W. Thompto, IBM|website = [[YouTube]]|date=25 September 2020 }}{{cbignore}}</ref> although for arbitrary Matrix widths that do not fit the exact SIMD size data repetition techniques are needed which is wasteful of register file resources.<ref>{{Cite arXiv <!-- unsupported parameter |url=https://arxiv.org/pdf/2104.03142 --> |eprint=2104.03142 |last1 = Moreira|first1 = José E.|last2 = Barton|first2 = Kit|last3 = Battle|first3 = Steven|last4 = Bergner|first4 = Peter|last5 = Bertran|first5 = Ramon|last6 = Bhat|first6 = Puneeth|last7 = Caldeira|first7 = Pedro|last8 = Edelsohn|first8 = David|last9 = Fossum|first9 = Gordon|last10 = Frey|first10 = Brad|last11 = Ivanovic|first11 = Nemanja|last12 = Kerchner|first12 = Chip|last13 = Lim|first13 = Vincent|last14 = Kapoor|first14 = Shakti|author15 = Tulio Machado Filho|author16 = Silvia Melitta Mueller|last17 = Olsson|first17 = Brett|last18 = Sadasivam|first18 = Satish|last19 = Saleil|first19 = Baptiste|last20 = Schmidt|first20 = Bill|last21 = Srinivasaraghavan|first21 = Rajalakshmi|last22 = Srivatsan|first22 = Shricharan|last23 = Thompto|first23 = Brian|last24 = Wagner|first24 = Andreas|last25 = Wu|first25 = Nelson|title = A matrix math facility for Power ISA(TM) processors|year = 2021|class = cs.AR}}</ref><ref>{{Cite book|chapter-url=https://link.springer.com/chapter/10.1007/978-1-4471-1011-8_8|doi=10.1007/978-1-4471-1011-8_8|chapter=A Modular Massively Parallel Processor for Volumetric Visualisation Processing|title=High Performance Computing for Computer Graphics and Visualisation|year=1996|last1=Krikelis|first1=Anargyros|pages=101–124|isbn=978-3-540-76016-0}}</ref> NVidia provides a high-level Matrix [[CUDA]] API although the internal details are not available.<ref>{{Cite web|url=https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#wmma|title = CUDA C++ Programming Guide}}</ref> The most resource-efficient technique is in-place reordering of access to otherwise linear vector data.
* '''Advanced Math formats''' – often includes [[Galois field]] arithmetic, but can include [[binary-coded decimal]] or decimal fixed-point, and support for much larger (arbitrary precision) arithmetic operations by supporting parallel carry-in and carry-out
* '''[[Bit manipulation]]''' – including vectorised versions of bit-level permutation operations, bitfield insert and extract, centrifuge operations, population count, and [[Bit Manipulation Instruction Sets|many others]].

=== GPU vector processing features ===

With many 3D [[shader]] applications needing [[trigonometric]] operations as well as short vectors for common operations (RGB, ARGB, XYZ, XYZW) support for the following is typically present in modern GPUs, in addition to those found in vector processors:

* '''Sub-vectors''' – elements may typically contain two, three or four sub-elements (vec2, vec3, vec4) where any given bit of a predicate mask applies to the whole vec2/3/4, not the elements in the sub-vector. Sub-vectors are also introduced in RISC-V RVV (termed "LMUL").<ref>[https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#mapping-for-lmul-1-2 LMUL > 1 in RVV]</ref> Subvectors are a critical integral part of the [[Vulkan]] [[SPIR-V]] spec.
* '''Sub-vector Swizzle''' – aka "Lane Shuffling" which allows sub-vector inter-element computations without needing extra (costly, wasteful) instructions to move the sub-elements into the correct SIMD "lanes" and also saves predicate mask bits. Effectively an in-flight [[permute instruction|mini-permute]] of the sub-vector, this heavily features in 3D Shader binaries and is sufficiently important as to be part of the Vulkan SPIR-V spec. The Broadcom [[Videocore]] IV uses the terminology "Lane rotate"<ref>[https://patents.google.com/patent/US20110227920 Abandoned US patent US20110227920-0096]</ref> where the rest of the industry uses the term [[Swizzling (computer graphics)|"swizzle"]].<ref>[https://github.com/hermanhermitage/videocoreiv-qpu Videocore IV QPU]</ref>
* '''Transcendentals''' – [[trigonometric]] operations such as [[sine]], [[cosine]] and [[logarithm]] obviously feature much more predominantly in 3D than in many demanding [[High-performance computing|HPC]] workloads. Of interest, however, is that speed is far more important than accuracy in 3D for GPUs, where computation of pixel coordinates simply do not require high precision. The Vulkan specification recognises this and sets surprisingly low accuracy requirements, so that GPU Hardware can reduce power usage. The concept of reducing accuracy where it is simply not needed is explored in the [[MIPS-3D]] extension.

=== Fault (or Fail) First ===

Introduced in ARM SVE2 and RISC-V RVV is the concept of speculative sequential Vector Loads. ARM SVE2 has a special register named "First Fault Register",<ref>[https://developer.arm.com/tools-and-software/server-and-hpc/compile/arm-instruction-emulator/resources/tutorials/sve/sve-vs-sve2/single-page Introduction to ARM SVE2]</ref> where RVV modifies (truncates) the Vector Length (VL).<ref>[https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#unit-stride-fault-only-first-loads RVV fault-first loads]</ref>

The basic principle of {{Not a typo|ffirst}} is to attempt a large sequential Vector Load, but to allow the hardware to arbitrarily truncate the ''actual'' amount loaded to either the amount that would succeed without raising a memory fault or simply to an amount (greater than zero) that is most convenient. The important factor is that ''subsequent'' instructions are notified or may determine exactly how many Loads actually succeeded, using that quantity to only carry out work on the data that has actually been loaded.

Contrast this situation with SIMD, which is a fixed (inflexible) load width and fixed data processing width, unable to cope with loads that cross page boundaries, and even if they were they are unable to adapt to what actually succeeded, yet, paradoxically, if the SIMD program were to even attempt to find out in advance (in each inner loop, every time) what might optimally succeed, those instructions only serve to hinder performance because they would, by necessity, be part of the critical inner loop.

This begins to hint at the reason why {{Not a typo|ffirst}} is so innovative, and is best illustrated by memcpy or strcpy when implemented with standard 128-bit non-predicated {{Not a typo|non-ffirst}} SIMD. For IBM POWER9 the number of hand-optimised instructions to implement strncpy is in excess of 240.<ref>[https://patchwork.ozlabs.org/project/glibc/patch/20200904165653.16202-1-rzinsly@linux.ibm.com/ PATCH to libc6 to add optimised POWER9 strncpy]</ref>
By contrast, the same strncpy routine in hand-optimised RVV assembler is a mere 22 instructions.<ref>[https://github.com/riscv/riscv-v-spec/blob/master/example/strncpy.s RVV strncpy example]</ref>

The above SIMD example could potentially fault and fail at the end of memory, due to attempts to read too many values: it could also cause significant numbers of page or misaligned faults by similarly crossing over boundaries. In contrast, by allowing the vector architecture the freedom to decide how many elements to load, the first part of a strncpy, if beginning initially on a sub-optimal memory boundary, may return just enough loads such that on ''subsequent'' iterations of the loop the batches of vectorised memory reads are optimally aligned with the underlying caches and virtual memory arrangements. Additionally, the hardware may choose to use the opportunity to end any given loop iteration's memory reads ''exactly'' on a page boundary (avoiding a costly second TLB lookup), with speculative execution preparing the next virtual memory page whilst data is still being processed in the current loop. All of this is determined by the hardware, not the program itself.<ref>[https://alastairreid.github.io/papers/sve-ieee-micro-2017.pdf ARM SVE2 paper by N. Stevens]</ref>