Editing Vector processor (section)

==== SIMD reduction ====

This is where the problems start. SIMD by design is incapable of doing arithmetic operations "inter-element". Element 0 of one SIMD register may be added to Element 0 of another register, but Element 0 may '''not''' be added to anything '''other''' than another Element 0. This places some severe limitations on potential implementations. For simplicity it can be assumed that n is exactly 8:

<syntaxhighlight lang=gas>
  addl      r3, x, $16 ; for 2nd 4 of x
  load32x4  v1, x      ; first 4 of x
  load32x4  v2, r3     ; 2nd 4 of x
  add32x4   v1, v2, v1 ; add 2 groups
</syntaxhighlight>

At this point four adds have been performed:
* {{code|x[0]+x[4]}} - First SIMD ADD: element 0 of first group added to element 0 of second group
* {{code|x[1]+x[5]}} - Second SIMD ADD: element 1 of first group added to element 1 of second group
* {{code|x[2]+x[6]}} - Third SIMD ADD: element 2 of first group added to element 2 of second group
* {{code|x[3]+x[7]}} - Fourth SIMD ADD: element 3 of first group added to element 3 of second group
but with 4-wide SIMD being incapable '''by design''' of adding {{code|x[0]+x[1]}} for example, things go rapidly downhill just as they did with the general case of using SIMD for general-purpose IAXPY loops. To sum the four partial results, two-wide SIMD can be used, followed by a single scalar add, to finally produce the answer, but, frequently, the data must be transferred out of dedicated SIMD registers before the last scalar computation can be performed.

Even with a general loop (n not fixed), the only way to use 4-wide SIMD is to assume four separate "streams", each offset by four elements. Finally, the four partial results have to be summed. Other techniques involve shuffle:
examples online can be found for [[AVX-512]] of how to do "Horizontal Sum"<ref>{{Cite web|url=https://stackoverflow.com/questions/52782940/1-to-4-broadcast-and-4-to-1-reduce-in-avx-512|title = Sse - 1-to-4 broadcast and 4-to-1 reduce in AVX-512}}</ref><ref>{{Cite web|url=https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-or-other-reduction/35270026#35270026|title = Assembly - Fastest way to do horizontal SSE vector sum (Or other reduction)}}</ref>

Aside from the size of the program and the complexity, an additional potential problem arises if floating-point computation is involved: the fact that the values are not being summed in strict order (four partial results) could result in rounding errors.