Editing SSE3

{{Short description|CPU instruction set}}
{{Distinguish|SSSE3}}
'''SSE3''', '''Streaming SIMD Extensions 3''', also known by its [[Intel]] code name '''Prescott New Instructions''' ('''PNI'''),<ref name=":1">{{Cite web |last1=Shimpi |first1=Anand Lal |last2=Wilson |first2=Derek |title=Intel's Pentium 4 E: Prescott Arrives with Luggage |url=https://www.anandtech.com/show/1230 |access-date=2023-04-10 |website=www.anandtech.com}}</ref> is the third iteration of the [[Streaming SIMD Extensions|SSE]] instruction set for the [[IA-32]] (x86) architecture. Intel introduced SSE3 in early 2004 with the [[Pentium 4#Prescott|Prescott]] revision of their [[Pentium 4]] CPU.<ref name=":1" /> In April 2005, [[AMD]] introduced a subset of SSE3 in revision E (Venice and San Diego) of their [[Athlon 64]] CPUs.<ref>{{Cite web |last=Shimpi |first=Anand Lal |title=Industry Update - Q4-2004: AMD adds SSE3 Support, Intel's 925/915 not selling and more |url=https://www.anandtech.com/show/1532 |access-date=2023-04-10 |website=www.anandtech.com}}</ref> The earlier [[SIMD]] instruction sets on the [[x86]] platform, from oldest to newest, are [[MMX (instruction set)|MMX]], [[3DNow!]] (developed by AMD, no longer supported on newer CPUs), [[Streaming SIMD Extensions|SSE]], and [[SSE2]].

SSE3 contains 13 new instructions over [[SSE2]].<ref>{{Cite web |title=Intel Instruction Set Extensions Technology |url=https://www.intel.com/content/www/us/en/support/articles/000005779/processors.html |access-date=2023-04-10 |website=Intel |language=en}}</ref>

==Changes==
The most notable change is the capability to work horizontally in a register, as opposed to the more or less strictly vertical operation of all previous SSE instructions. More specifically, instructions to add and subtract the multiple values stored within a single register have been added.<ref name=":2">{{Cite web |last=Wright |first=Christopher |title=SSE3 Instruction Set |url=https://softpixel.com/~cwright/programming/simd/sse3.php |access-date=2023-04-10 |website=softpixel.com |language=en}}</ref> These instructions can be used to speed up the implementation of a number of [[Digital signal processing|DSP]] and [[3D computer graphics|3D]] operations. There is also a new instruction to convert floating point values to integers without having to change the global rounding mode, thus avoiding costly [[Instruction pipeline|pipeline]] stalls. Finally, the extension adds <code>LDDQU</code>, an alternative misaligned integer vector load that has better performance on [[NetBurst]] based platforms for loads that cross cacheline boundaries.<ref>{{Cite web |title=LDDQU — Load Unaligned Integer 128 Bits |url=https://www.felixcloutier.com/x86/lddqu |access-date=2023-04-10 |website=www.felixcloutier.com}}</ref>

==CPUs with SSE3==
*[[AMD]]:
**[[Opteron]] (since Stepping E4<ref>{{Cite web |last=Wilson |first=Derek |title=AMD K8 E4 Stepping: SSE3 Performance |url=https://www.anandtech.com/show/1618 |access-date=2023-04-10 |website=www.anandtech.com}}</ref>)
**[[Sempron]] (since Palermo. Stepping E3)
**[[Athlon 64]] (since Venice Stepping E3 and San Diego Stepping E4)
**[[Athlon 64|Athlon 64 FX]] (since San Diego Stepping E4)
**[[Athlon 64 X2]]
**[[Phenom 64 X2]]
**[[AMD Turion|Turion]] family
**[[AMD 10h|K10]] family
**[[AMD Accelerated Processing Unit|APU]] family (including without GPU)
**[[AMD FX|FX Series]]
** [[Zen (microarchitecture)|Zen]] family
*[[Intel]]:
**[[Celeron D]]
**[[Celeron]] (starting with Core microarchitecture)
**[[Pentium 4]] (since Prescott)
**[[Pentium D]]
**[[Pentium Extreme Edition]] (but NOT Pentium 4 Extreme Edition)
**[[Pentium Dual-Core]]
**[[Pentium]] (starting with Core microarchitecture)
**[[Intel Core|Core]]
**[[Xeon]] (since Nocona<ref>{{Cite web |date=2004-08-18 |title=Intel Xeon 3.4GHz ['Nocona' core] |url=https://hexus.net/business/reviews/enterprise/822-intel-xeon-34ghz-nocona-core/ |access-date=2023-04-10 |website=HEXUS}}</ref>)
**[[Intel Atom|Atom]]
*[[VIA Technologies|VIA]]/[[Centaur Technology|Centaur]]:
**[[VIA C7|C7]]
**[[VIA Nano|Nano]]
*[[Transmeta Efficeon]] TM88xx with Code Morphing software update (NOT Model Numbers TM86xx)

==New instructions==
===Common instructions===
====Arithmetic====
;<code>ADDSUBPD</code>
:''Add-Subtract-Packed-Double''<ref name=":0">{{Cite web |title=SSE3 Instructions - x86 Assembly Language Reference Manual |url=https://docs.oracle.com/cd/E53394_01/html/E54851/gntby.html |access-date=2023-04-10 |website=docs.oracle.com}}</ref>
:*Input: { A0, A1 }, { B0, B1 }
:*Output: { A0 − B0, A1 + B1 }
;<code>ADDSUBPS</code>
:''Add-Subtract-Packed-Single''<ref name=":0" />
:* Input: { A0, A1, A2, A3 }, { B0, B1, B2, B3 }
:* Output: { A0 − B0, A1 + B1, A2 − B2, A3 + B3 }

====AOS ( Array Of Structures )====
;<code>HADDPD</code>
:''Horizontal-Add-Packed-Double''<ref name=":0" />
:* Input: { A0, A1 }, { B0, B1 }
:* Output: { A0 + A1, B0 + B1 }
;<code>HADDPS</code>
:''Horizontal-Add-Packed-Single''<ref name=":0" />
:* Input: { A0, A1, A2, A3 }, { B0, B1, B2, B3 }
:* Output: { A0 + A1, A2 + A3, B0 + B1, B2 + B3 }
;<code>HSUBPD</code>
:''Horizontal-Subtract-Packed-Double''<ref name=":0" />
:* Input: { A0, A1 }, { B0, B1 }
:* Output: { A0 − A1, B0 − B1 }
;<code>HSUBPS</code>
:''Horizontal-Subtract-Packed-Single''<ref name=":0" />
:* Input: { A0, A1, A2, A3 }, { B0, B1, B2, B3 }
:* Output: { A0 − A1, A2 − A3, B0 − B1, B2 − B3 }
;<code>LDDQU</code>
:As stated above, this is an alternative misaligned integer vector load.<ref name=":0" /> It can be helpful for video compression tasks.
;<code>[[MOVDDUP]]</code>, <code>MOVSHDUP</code>, <code>MOVSLDUP</code><ref name=":2" />
:These are useful for complex numbers and wave calculation like sound.
;<code>FISTTP</code>
:Like the older x87 <code>FISTP</code> instruction, but ignores the floating point control register's rounding mode settings and uses the "chop" (truncate) mode instead.<ref name=":2" /> Allows omission of the expensive loading and re-loading of the control register in languages such as C where float-to-int conversion requires truncate behaviour by standard.

===Other instructions===
;<code>MONITOR</code>, <code>MWAIT</code>
:The <code>MONITOR</code> instruction is used to specify a memory address for monitoring, while the <code>MWAIT</code> instruction puts the processor into a low-power state and waits for a write event to the monitored address.<ref name=":2" />

==References==
{{reflist}}

==External links==
*[https://web.archive.org/web/20060531094837/http://www.xbitlabs.com/articles/cpu/display/prescott_10.html X-bit Labs]

{{Multimedia extensions}}

{{DEFAULTSORT:Sse3}}
[[Category:X86 instructions]]
[[Category:SIMD computing]]