Editing Streaming SIMD Extensions

{{Short description|Computer chip instruction set extension}}
{{More citations needed|date=June 2014}}

In [[computing]], '''Streaming SIMD Extensions''' ('''SSE''') is a single instruction, multiple data ([[SIMD]]) [[instruction set]] extension to the [[x86]] architecture, designed by [[Intel]] and introduced in 1999 in its [[Pentium III]] series of [[central processing unit]]s (CPUs) shortly after the appearance of [[Advanced Micro Devices]] (AMD's) [[3DNow!]]. SSE contains 70 new instructions (65 unique mnemonics<ref>{{cite web
| url=https://cdrdv2.intel.com/v1/dl/getContent/671200
| title=Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 1: Basic Architecture
| date=April 2022
| publisher=Intel
| pages=((5{{hyphen}}16{{ndash}}5{{hyphen}}19))
| access-date=May 16, 2022
| archive-date=April 25, 2022
| archive-url=https://web.archive.org/web/20220425144301/https://cdrdv2.intel.com/v1/dl/getContent/671200
| url-status=live
}}</ref> using 70 encodings), most of which work on [[single precision]] [[floating-point]] data. SIMD instructions can greatly increase performance when exactly the same operations are to be performed on multiple data objects. Typical applications are [[digital signal processing]] and [[graphics processing]].

Intel's first [[IA-32]] SIMD effort was the [[MMX (instruction set)|MMX]] instruction set. MMX had two main problems: it re-used existing [[x87]] floating-point registers making the CPUs unable to work on both floating-point and SIMD data at the same time, and it only worked on [[integers]]. SSE floating-point instructions operate on a new independent register set, the XMM registers, and adds a few integer instructions that work on MMX registers.

SSE was subsequently expanded by Intel to [[SSE2]], [[SSE3]], [[SSSE3]] and [[SSE4]]. Because it supports floating-point math, it had wider applications than MMX and became more popular. The addition of integer support in SSE2 made MMX largely redundant, though further performance increases can be attained in some situations{{when|date=November 2017}} by using MMX in parallel with SSE operations.

SSE was originally called '''Katmai New Instructions''' ('''KNI'''), [[Katmai (microprocessor)|Katmai]] being the code name for the first Pentium III core revision. During the Katmai project Intel sought to distinguish it from its earlier product line, particularly its flagship [[Pentium II]]. It was later renamed '''Internet Streaming SIMD Extensions''' ('''ISSE'''<ref name="MPR=1999-03-08"/>), then SSE. 

AMD added a subset of SSE, 19 of them, called new MMX instructions,<ref name="extman">{{cite web|url=https://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22466.pdf|title=AMD Extensions to the 3DNow and MMX Instruction Sets Manual|publisher=[[Advanced Micro Devices, Inc.]]|date=March 2000|access-date=2024-04-18|archive-date=2008-05-17|archive-url=https://web.archive.org/web/20080517014932/http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22466.pdf|url-status=dead}}</ref> and known as several variants and combinations of SSE and MMX, shortly after with the release of the original [[Athlon]] in August 1999, see [[3DNow!#3DNow!_extensions|3DNow! extensions]]. AMD eventually added full support for SSE instructions, starting with its [[Athlon XP]] and [[Duron]] ([[Duron#Morgan (Model 7, 180 nm)|Morgan core]]) processors.

== Registers ==
SSE originally added eight new 128-bit registers known as <code>XMM0</code> through <code>XMM7</code>. The [[AMD64]] extensions from AMD added a further eight registers <code>XMM8</code> through <code>XMM15</code>, and this extension is duplicated in the [[Intel 64]] architecture. There is also a new 32-bit control/status register, <code>MXCSR</code>. The registers <code>XMM8</code> through <code>XMM15</code> are accessible only in 64-bit operating mode.
 
[[Image:XMM registers.svg|right|220px]]

SSE used only a single data type for XMM registers:
* four 32-bit [[single-precision]] floating-point numbers

[[SSE2]] would later expand the usage of the XMM registers to include:
* two 64-bit [[double-precision]] floating-point numbers or
* two 64-bit integers or
* four 32-bit integers or
* eight 16-bit short integers or
* sixteen 8-bit bytes or characters.

Because these 128-bit registers are additional machine states that the [[operating system]] must preserve across [[context switch|task switches]], they are disabled by default until the operating system explicitly enables them. This means that the OS must know how to use the <code>FXSAVE</code> and <code>FXRSTOR</code> instructions, which is the extended pair of instructions that can save all [[x86]] and SSE register states at once. This support was quickly added to all major IA-32 operating systems.

The first CPU to support SSE, the [[Pentium III]], shared execution resources between SSE and the [[floating-point unit]] (FPU).<ref name="MPR=1999-03-08">{{cite journal|url=http://docencia.ac.upc.edu/ETSETB/SEGPAR/microprocessors/pentium3%20(mpr).pdf|author=Diefendorff, Keith|date=March 8, 1999|title=Pentium III = Pentium II + SSE: Internet SSE Architecture Boosts Multimedia Performance|journal=[[Microprocessor Report]]|volume=13|issue=3|access-date=September 1, 2017|archive-date=April 17, 2018|archive-url=https://web.archive.org/web/20180417203519/http://docencia.ac.upc.edu/ETSETB/SEGPAR/microprocessors/pentium3%20%28mpr%29.pdf|url-status=live}}</ref> While a [[compiled]] application can interleave FPU and SSE instructions side-by-side, the Pentium III will not issue an FPU and an SSE instruction in the same [[clock cycle]]. This limitation reduces the effectiveness of [[instruction pipeline|pipelining]], but the separate XMM registers do allow SIMD and scalar floating-point operations to be mixed without the performance hit from explicit MMX/floating-point mode switching.

== SSE instructions ==
SSE introduced both [[Scalar (computing)|scalar]] and [[Data structure alignment|packed]] floating-point instructions.

=== Floating-point instructions ===
* Memory-to-register/register-to-memory/register-to-register data movement
** Scalar – <code>MOVSS</code>
** Packed – <code>MOVAPS, MOVUPS, MOVLPS, MOVHPS, MOVLHPS, MOVHLPS, MOVMSKPS</code>
* Arithmetic
** Scalar – <code>ADDSS, SUBSS, MULSS, DIVSS, RCPSS, SQRTSS, MAXSS, MINSS, RSQRTSS</code>
** Packed – <code>ADDPS, SUBPS, MULPS, DIVPS, RCPPS, SQRTPS, MAXPS, MINPS, RSQRTPS</code>
* [[Comparison (computer programming)|Compare]]
** Scalar – <code>CMPSS, COMISS, UCOMISS</code>
** Packed – <code>CMPPS</code>
* Data shuffle and unpacking
** Packed – <code>SHUFPS, UNPCKHPS, UNPCKLPS</code>
* [[Type conversion|Data-type conversion]]
** Scalar – <code>CVTSI2SS, CVTSS2SI, CVTTSS2SI</code>
** Packed – <code>CVTPI2PS, CVTPS2PI, CVTTPS2PI</code>
* [[Bitwise operation|Bitwise]] logical operations
** Packed – <code>ANDPS, ORPS, XORPS, ANDNPS</code>

=== Integer instructions ===
* Arithmetic
** <code>PMULHUW, PSADBW, PAVGB, PAVGW, PMAXUB, PMINUB, PMAXSW, PMINSW</code>
* Data movement
** <code>PEXTRW, PINSRW</code>
* Other
** <code>PMOVMSKB, PSHUFW</code>

=== Other instructions ===
* <code>MXCSR</code> management
** <code>LDMXCSR, STMXCSR</code>
* Cache and Memory management
** <code>MOVNTQ, MOVNTPS, MASKMOVQ, PREFETCH0, PREFETCH1, PREFETCH2, PREFETCHNTA, SFENCE</code>

== Example ==
The following simple example demonstrates the advantage of using SSE. Consider an operation like vector addition, which is used very often in computer graphics applications. To add two single precision, four-component vectors together using x86 requires four floating-point addition instructions.

<syntaxhighlight lang="c">
 vec_res.x = v1.x + v2.x;
 vec_res.y = v1.y + v2.y;
 vec_res.z = v1.z + v2.z;
 vec_res.w = v1.w + v2.w;
</syntaxhighlight>

This corresponds to four x86 FADD instructions in the object code. On the other hand, as the following pseudo-code shows, a single 128-bit 'packed-add' instruction can replace the four scalar addition instructions.
<syntaxhighlight lang="nasm">
 movaps xmm0, [v1] ;xmm0 = v1.w | v1.z | v1.y | v1.x 
 addps xmm0, [v2]  ;xmm0 = v1.w+v2.w | v1.z+v2.z | v1.y+v2.y | v1.x+v2.x
 movaps [vec_res], xmm0  ;xmm0
</syntaxhighlight>

== Later versions ==
* [[SSE2]], Willamette New Instructions (WNI), introduced with the [[Pentium 4]], is a major enhancement to SSE. SSE2 adds two major features: [[double-precision]] (64-bit) floating-point for all SSE operations, and MMX integer operations on 128-bit XMM registers. In the original SSE instruction set, conversion to and from integers placed the integer data in the 64-bit MMX registers. SSE2 enables the programmer to perform SIMD math on any data type (from 8-bit integer to 64-bit float) entirely with the XMM vector-register file, without the need to use the legacy MMX or FPU registers. It offers an [[orthogonal instruction set|orthogonal set of instructions]] for dealing with common data types.
* [[SSE3]], also called Prescott New Instructions (PNI), is an incremental upgrade to SSE2, adding a handful of DSP-oriented mathematics instructions and some process (thread) management instructions. It also allowed addition or multiplication of two numbers that are stored in the same register, which wasn't possible in SSE2 and earlier. This capability, known as horizontal in Intel terminology, was the major addition to the SSE3 instruction set. AMD's [[3DNow!]] extension could do the latter too.
* [[SSSE3]], Merom New Instructions (MNI), is an upgrade to SSE3, adding 16 new instructions which include permuting the bytes in a word, multiplying 16-bit fixed-point numbers with correct rounding, and within-word accumulate instructions. SSSE3 is often mistaken for SSE4 as this term was used during the development of the Core [[microarchitecture]].
* [[SSE4]], Penryn New Instructions (PNI), is another major enhancement, adding a [[dot product]] instruction, additional integer instructions, a [[SSE4#POPCNT and LZCNT|<syntaxhighlight lang="asm" inline>popcnt</syntaxhighlight> instruction]] ([[Hamming weight|Population count]]: count number of bits set to 1, used extensively e.g. in [[cryptography]]), and more.
* [[XOP instruction set|XOP]], [[FMA instruction set|FMA4]] and [[CVT16 instruction set|CVT16]] are new iterations announced by [[AMD]] in August 2007<ref name=":0">{{cite web
| url=https://www.theregister.co.uk/2007/08/30/amd_sse5/
| title=AMD plots single thread boost with x86 extensions
| website=[[The Register]]
| first=Ashlee
| last=Vance
| author-link=Ashlee Vance
| date=August 3, 2007
| access-date=August 24, 2017
| archive-date=April 27, 2011
| archive-url=https://web.archive.org/web/20110427144442/http://www.theregister.co.uk/2007/08/30/amd_sse5/
| url-status=live
}}</ref><ref>{{cite web
| url=http://developer.amd.com/wordpress/media/2012/10/AMD64_128_Bit_SSE5_Instrs.pdf
| title=AMD64 Technology: 128-Bit SSE5 Instruction Set
| date=August 2007
| publisher=[[AMD]]
| access-date=August 24, 2017
| archive-date=August 25, 2017
| archive-url=https://web.archive.org/web/20170825103549/http://developer.amd.com/wordpress/media/2012/10/AMD64_128_Bit_SSE5_Instrs.pdf
| url-status=live
}}</ref> and revised in May 2009.<ref>{{cite web
| url=https://support.amd.com/TechDocs/43479.pdf
| title=AMD64 Technology AMD64 Architecture Programmer's Manual Volume 6: 128-Bit and 256-Bit XOP and FMA4 Instructions
| date=November 2009
| publisher=AMD
| access-date=August 24, 2017
| archive-date=January 31, 2017
| archive-url=https://web.archive.org/web/20170131212831/http://support.amd.com/TechDocs/43479.pdf
| url-status=live
}}</ref>
* [[Advanced Vector Extensions]] (AVX), Gesher New Instructions (GNI), is an advanced version of SSE announced by Intel featuring a widened data path from 128 bits to 256 bits and 3-operand instructions (up from 2). Intel released processors in early 2011 with AVX support.<ref>{{cite web
| last=Girkar
| first=Milind
| url=https://software.intel.com/en-us/isa-extensions/intel-avx
| title=Intel® Advanced Vector Extensions (Intel® AVX)
| publisher=[[Intel]]
| date=October 1, 2013
| access-date=August 24, 2017
| archive-date=August 25, 2017
| archive-url=https://web.archive.org/web/20170825102628/https://software.intel.com/en-us/isa-extensions/intel-avx
| url-status=live
}}</ref>
* [[Advanced Vector Extensions#Advanced Vector Extensions 2|AVX2]] is an expansion of the AVX instruction set.
* [[AVX-512]] (3.1 and 3.2) are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture.

== Identifying ==
The following programs can be used to determine which, if any, versions of SSE are supported on a system
* Intel Processor Identification Utility<ref>{{cite web
| url=https://www.intel.com/content/www/us/en/support/processors/000005651.html
| title=Download the Intel® Processor Identification Utility
| date=July 24, 2017
| publisher=Intel
| access-date=August 24, 2017
| archive-date=August 25, 2017
| archive-url=https://web.archive.org/web/20170825105030/https://www.intel.com/content/www/us/en/support/processors/000005651.html
| url-status=live
}}</ref>
* [[CPU-Z]] – CPU, motherboard, and memory identification utility.
* [[util-linux|lscpu]] - provided by the util-linux package in most Linux distributions.

== See also ==
* [[AltiVec]] - equivalent on [[PowerPC]] architecture

== References ==
{{Reflist}}

== External links ==
* [https://software.intel.com/sites/landingpage/IntrinsicsGuide/ Intel Intrinsics Guide]

{{Multimedia extensions}}

{{Use mdy dates|date=October 2018}}

{{DEFAULTSORT:Streaming Simd Extensions}}
[[Category:SIMD computing]]
[[Category:X86 instructions]]