Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Vector processor
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Vector reduction example === This example starts with an algorithm which involves reduction. Just as with the previous example, it will be first shown in scalar instructions, then SIMD, and finally vector instructions, starting in [[C (programming language)|c]]: <syntaxhighlight lang=c> void (size_t n, int a, const int x[]) { int y = 0; for (size_t i = 0; i < n; i++) y += x[i]; return y; } </syntaxhighlight> Here, an accumulator (y) is used to sum up all the values in the array, x. ==== Scalar assembler ==== The scalar version of this would load each of x, add it to y, and loop: <syntaxhighlight lang=gas> set y, 0 ; y initialised to zero loop: load32 r1, x ; load one 32bit data add32 y, y, r1 ; y := y + r1 addl x, x, $4 ; x := x + 4 subl n, n, $1 ; n := n - 1 jgz n, loop ; loop back if n > 0 out: ret y ; returns result, y </syntaxhighlight> This is very straightforward. "y" starts at zero, 32 bit integers are loaded one at a time into r1, added to y, and the address of the array "x" moved on to the next element in the array. ==== SIMD reduction ==== This is where the problems start. SIMD by design is incapable of doing arithmetic operations "inter-element". Element 0 of one SIMD register may be added to Element 0 of another register, but Element 0 may '''not''' be added to anything '''other''' than another Element 0. This places some severe limitations on potential implementations. For simplicity it can be assumed that n is exactly 8: <syntaxhighlight lang=gas> addl r3, x, $16 ; for 2nd 4 of x load32x4 v1, x ; first 4 of x load32x4 v2, r3 ; 2nd 4 of x add32x4 v1, v2, v1 ; add 2 groups </syntaxhighlight> At this point four adds have been performed: * {{code|x[0]+x[4]}} - First SIMD ADD: element 0 of first group added to element 0 of second group * {{code|x[1]+x[5]}} - Second SIMD ADD: element 1 of first group added to element 1 of second group * {{code|x[2]+x[6]}} - Third SIMD ADD: element 2 of first group added to element 2 of second group * {{code|x[3]+x[7]}} - Fourth SIMD ADD: element 3 of first group added to element 3 of second group but with 4-wide SIMD being incapable '''by design''' of adding {{code|x[0]+x[1]}} for example, things go rapidly downhill just as they did with the general case of using SIMD for general-purpose IAXPY loops. To sum the four partial results, two-wide SIMD can be used, followed by a single scalar add, to finally produce the answer, but, frequently, the data must be transferred out of dedicated SIMD registers before the last scalar computation can be performed. Even with a general loop (n not fixed), the only way to use 4-wide SIMD is to assume four separate "streams", each offset by four elements. Finally, the four partial results have to be summed. Other techniques involve shuffle: examples online can be found for [[AVX-512]] of how to do "Horizontal Sum"<ref>{{Cite web|url=https://stackoverflow.com/questions/52782940/1-to-4-broadcast-and-4-to-1-reduce-in-avx-512|title = Sse - 1-to-4 broadcast and 4-to-1 reduce in AVX-512}}</ref><ref>{{Cite web|url=https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-or-other-reduction/35270026#35270026|title = Assembly - Fastest way to do horizontal SSE vector sum (Or other reduction)}}</ref> Aside from the size of the program and the complexity, an additional potential problem arises if floating-point computation is involved: the fact that the values are not being summed in strict order (four partial results) could result in rounding errors. ==== Vector ISA reduction ==== Vector instruction sets have arithmetic reduction operations ''built-in'' to the ISA. If it is assumed that n is less or equal to the maximum vector length, only three instructions are required: <syntaxhighlight lang=gas> setvl t0, n # VL=t0=min(MVL, n) vld32 v0, x # load vector x vredadd32 y, v0 # reduce-add into y </syntaxhighlight> The code when n is larger than the maximum vector length is not that much more complex, and is a similar pattern to the first example ("IAXPY"). <syntaxhighlight lang=gas> set y, 0 vloop: setvl t0, n # VL=t0=min(MVL, n) vld32 v0, x # load vector x vredadd32 y, y, v0 # add all x into y add x, t0*4 # advance x by VL*4 sub n, t0 # n -= VL (t0) bnez n, vloop # repeat if n != 0 ret y </syntaxhighlight> The simplicity of the algorithm is stark in comparison to SIMD. Again, just as with the IAXPY example, the algorithm is length-agnostic (even on Embedded implementations where maximum vector length could be only one). Implementations in hardware may, if they are certain that the right answer will be produced, perform the reduction in parallel. Some vector ISAs offer a parallel reduction mode as an explicit option, for when the programmer knows that any potential rounding errors do not matter, and low latency is critical.<ref>{{Cite web|url=https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-reduction-operations|title = Riscv-v-spec/V-spec.adoc at master Β· riscv/Riscv-v-spec|website = [[GitHub]]| date=19 November 2022 }}</ref> This example again highlights a key critical fundamental difference between true vector processors and those SIMD processors, including most commercial GPUs, which are inspired by features of vector processors.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)