Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Vector processor
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==== SIMD reduction ==== This is where the problems start. SIMD by design is incapable of doing arithmetic operations "inter-element". Element 0 of one SIMD register may be added to Element 0 of another register, but Element 0 may '''not''' be added to anything '''other''' than another Element 0. This places some severe limitations on potential implementations. For simplicity it can be assumed that n is exactly 8: <syntaxhighlight lang=gas> addl r3, x, $16 ; for 2nd 4 of x load32x4 v1, x ; first 4 of x load32x4 v2, r3 ; 2nd 4 of x add32x4 v1, v2, v1 ; add 2 groups </syntaxhighlight> At this point four adds have been performed: * {{code|x[0]+x[4]}} - First SIMD ADD: element 0 of first group added to element 0 of second group * {{code|x[1]+x[5]}} - Second SIMD ADD: element 1 of first group added to element 1 of second group * {{code|x[2]+x[6]}} - Third SIMD ADD: element 2 of first group added to element 2 of second group * {{code|x[3]+x[7]}} - Fourth SIMD ADD: element 3 of first group added to element 3 of second group but with 4-wide SIMD being incapable '''by design''' of adding {{code|x[0]+x[1]}} for example, things go rapidly downhill just as they did with the general case of using SIMD for general-purpose IAXPY loops. To sum the four partial results, two-wide SIMD can be used, followed by a single scalar add, to finally produce the answer, but, frequently, the data must be transferred out of dedicated SIMD registers before the last scalar computation can be performed. Even with a general loop (n not fixed), the only way to use 4-wide SIMD is to assume four separate "streams", each offset by four elements. Finally, the four partial results have to be summed. Other techniques involve shuffle: examples online can be found for [[AVX-512]] of how to do "Horizontal Sum"<ref>{{Cite web|url=https://stackoverflow.com/questions/52782940/1-to-4-broadcast-and-4-to-1-reduce-in-avx-512|title = Sse - 1-to-4 broadcast and 4-to-1 reduce in AVX-512}}</ref><ref>{{Cite web|url=https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-or-other-reduction/35270026#35270026|title = Assembly - Fastest way to do horizontal SSE vector sum (Or other reduction)}}</ref> Aside from the size of the program and the complexity, an additional potential problem arises if floating-point computation is involved: the fact that the values are not being summed in strict order (four partial results) could result in rounding errors.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)