Editing Simultaneous multithreading (section)

== Modern commercial implementations ==
The [[Intel]] [[Pentium 4]] was the first modern desktop processor to implement simultaneous multithreading, starting from the 3.06&nbsp;GHz model released in 2002, and since introduced into a number of their processors. Intel calls the functionality [[Hyper-Threading Technology]], and provides a basic two-thread SMT engine. Intel claims up to a 30% speed improvement<ref>{{cite journal|last1=Marr|first1=Deborah|title=Hyper-Threading Technology Architecture and Microarchitecture|journal=Intel Technology Journal|date=February 14, 2002|volume=6|issue=1|page=4|doi=10.1535/itj|url=http://www.diku.dk/OLD/undervisning/2004f/303/Hyper-Thread.pdf|access-date=25 September 2015|archive-date=24 October 2016|archive-url=https://web.archive.org/web/20161024004724/http://www.diku.dk/OLD/undervisning/2004f/303/Hyper-Thread.pdf|url-status=dead}}</ref> compared against an otherwise identical, non-SMT Pentium&nbsp;4. The performance improvement seen is very application-dependent; however, when running two programs that require full attention of the processor it can actually seem like one or both of the programs slows down slightly when Hyper-threading is turned on.<ref>{{cite web |title=CPU performance evaluation Pentium&nbsp;4 2.8 and 3.0 |url=http://users.telenet.be/nicvroom/performanceP4.htm |access-date=2011-04-22 |archive-date=2021-02-24 |archive-url=https://web.archive.org/web/20210224131422/http://users.telenet.be/nicvroom/performanceP4.htm |url-status=dead }}</ref> This is due to the [[replay system]] of the Pentium&nbsp;4 tying up valuable execution resources, increasing contention for resources such as bandwidth, caches, [[Translation Lookaside Buffer|TLBs]], [[re-order buffer]] entries, and equalizing the processor resources between the two programs which adds a varying amount of execution time. The Pentium&nbsp;4 Prescott core gained a replay queue, which reduces execution time needed for the replay system. This was enough to completely overcome that performance hit.<ref>{{cite web|title=Replay: Unknown Features of the NetBurst Core. Page 15|url=http://www.xbitlabs.com/articles/cpu/display/replay_15.html#sect0|website=Replay: Unknown Features of the NetBurst Core.|publisher=xbitlabs.com|access-date=24 April 2011|url-status=dead|archive-url=https://web.archive.org/web/20110514180659/http://www.xbitlabs.com/articles/cpu/display/replay_15.html#sect0|archive-date=14 May 2011}}</ref>

The latest [[Imagination Technologies]] [[MIPS architecture]] designs include an SMT system known as "MIPS MT".<ref>{{cite web|title=MIPS MT ASE description|work=Imagination Technologies |url=https://www.imgtec.com/mips/architectures/multi-threading/}}</ref> MIPS MT provides for both heavyweight virtual processing elements and lighter-weight hardware microthreads. [[RMI Corporation|RMI]], a Cupertino-based startup, is the first MIPS vendor to provide a processor [[System-on-a-chip|SOC]] based on eight cores, each of which runs four threads. The threads can be run in fine-grain mode where a different thread can be executed each cycle. The threads can also be assigned priorities. [[Imagination Technologies]] MIPS CPUs have two SMT threads per core.

IBM's [[Blue Gene]]/Q has 4-way SMT.

The IBM [[POWER5]], announced in May 2004, comes as either a dual core dual-chip module (DCM), or quad-core or oct-core multi-chip module (MCM), with each core including a two-thread SMT engine. [[IBM]]'s implementation is more sophisticated than the previous ones, because it can assign a different priority to the various threads, is more fine-grained, and the SMT engine can be turned on and off dynamically, to better execute those workloads where an SMT processor would not increase performance. This is IBM's second implementation of generally available hardware multithreading. In 2010, IBM released systems based on the POWER7 processor with eight cores with each having four Simultaneous Intelligent Threads. This switches the threading mode between one thread, two threads or four threads depending on the number of process threads being scheduled at the time. This optimizes the use of the core for minimum response time or maximum throughput. IBM [[POWER8]] has 8 intelligent simultaneous threads per core (SMT8).

[[IBM Z]] starting with the [[IBM z13 (microprocessor)|z13]] processor in 2013 has two threads per core (SMT-2).

Although many people reported that [[Sun Microsystems]]' UltraSPARC T1 (known as "Niagara" until its 14 November 2005 release) and the now defunct processor [[codename]]d "[[Rock processor|Rock]]" (originally announced in 2005, but after many delays cancelled in 2010) are implementations of [[SPARC]] focused almost entirely on exploiting SMT and CMP techniques, Niagara is not actually using SMT. Sun refers to these combined approaches as "CMT", and the overall concept as "Throughput Computing". The Niagara has eight cores, but each core has only one pipeline, so actually it uses fine-grained multithreading. Unlike SMT, where instructions from multiple threads share the issue window each cycle, the processor uses a round robin policy to issue instructions from the next active thread each cycle. This makes it more similar to a [[barrel processor]]. Sun Microsystems' Rock processor is different: it has more complex cores that have more than one pipeline.

The [[Oracle Corporation]] SPARC T3 has eight fine-grained threads per core; SPARC T4, SPARC T5, SPARC M5, M6 and M7 have eight fine-grained threads per core of which two can be executed simultaneously.

[[Fujitsu]] SPARC64 VI has coarse-grained Vertical Multithreading (VMT) SPARC VII and newer have 2-way SMT.

Intel [[Itanium]] Montecito uses coarse-grained multithreading and Tukwila and newer ones use 2-way SMT (with dual-domain multithreading).

[[Intel]] [[Xeon Phi]] has 4-way SMT (with time-multiplexed multithreading) with hardware-based threads which cannot be disabled, unlike regular Hyper-Threading.<ref>{{cite web |first1=Michaela |last1=Barth |first2=Mikko |last2=Byckling |first3=Nevena |last3=Ilieva |first4=Sami |last4=Saarinen |first5=Michael |last5=Schliephake |editor-first=Volker |editor-last=Weinberg |title=Best Practice Guide Intel Xeon Phi v1.1 |date=18 February 2014 |publisher=Partnership for Advanced Computing in Europe |url=http://www.prace-ri.eu/best-practice-guide-intel-xeon-phi-html/ |access-date=22 November 2016 |archive-date=3 May 2017 |archive-url=https://web.archive.org/web/20170503073453/http://www.prace-ri.eu/best-practice-guide-intel-xeon-phi-html/ |url-status=dead }}</ref> The [[Intel Atom]], first released in 2008, is the first Intel product to feature 2-way SMT (marketed as Hyper-Threading) without supporting instruction reordering, speculative execution, or register renaming. Intel reintroduced Hyper-Threading with the [[Nehalem (microarchitecture)|Nehalem microarchitecture]], after its absence on the [[Core (microarchitecture)|Core microarchitecture]].

AMD [[Bulldozer (microarchitecture)|Bulldozer microarchitecture]] FlexFPU<!-- Don't put three different links next to each other, it's confusing for readers --> and Shared L2 cache are multithreaded but integer cores in module are single threaded, so it is only a partial SMT implementation.<ref>{{cite web |title=AMD Bulldozer Family Module Multithreading |date=July 2013 |publisher=wccftech |url=http://cdn3.wccftech.com/wp-content/uploads/2013/07/AMD-Steamroller-vs-Bulldozer.jpg |access-date=2013-07-22 |archive-date=2013-10-17 |archive-url=https://web.archive.org/web/20131017014731/http://cdn3.wccftech.com/wp-content/uploads/2013/07/AMD-Steamroller-vs-Bulldozer.jpg |url-status=dead }}</ref><ref>{{cite web |first=Gareth |last=Halfacree |title=AMD unveils Flex FP |date=28 October 2010 |publisher=bit-tech |url=https://www.bit-tech.net/news/hardware/2010/10/28/amd-unveils-flex-fp/1}}</ref>

AMD [[Zen (microarchitecture)|Zen microarchitecture]] has 2-way SMT.

[[VISC architecture]]<ref name="urlSoft Machines unveils VISC virtual chip architecture | bit-tech.net">{{cite web |url=https://bit-tech.net/news/tech/cpus/soft-machines-visc/1/ |title=Soft Machines unveils VISC virtual chip architecture &#124; bit-tech.net |format= |accessdate=}}</ref><ref>{{cite web |first=Ian |last=Cutress |title=Examining Soft Machines' Architecture: An Element of VISC to Improving IPC |date=12 February 2016 |publisher=AnandTech  |url=http://www.anandtech.com/show/10025/examining-soft-machines-architecture-visc-ipc}}</ref><ref>{{cite web|title=Next Gen Processor Performance Revealed|work=VR World |date=February 4, 2016|url=https://vrworld.com/2016/02/04/next-gen-processor-performance-revealed/|archive-url=https://web.archive.org/web/20170113044935/https://vrworld.com/2016/02/04/next-gen-processor-performance-revealed/|archive-date=2017-01-13}}</ref><ref>{{cite web|title=Architectural Waves|year=2017|publisher=Soft Machines|url=http://www.softmachines.com/technology/|url-status=dead|archive-url=https://web.archive.org/web/20170329105223/http://www.softmachines.com/technology/|archive-date=2017-03-29}}</ref> uses the ''Virtual Software Layer'' (translation layer) to dispatch a single thread of instructions to the ''Global Front End'' which splits instructions into ''virtual hardware threadlets'' which are then dispatched to separate virtual cores. These virtual cores can then send them to the available resources on any of the physical cores. Multiple virtual cores can push threadlets into the reorder buffer of a single physical core, which can split partial instructions and data from multiple threadlets through the execution ports at the same time. Each virtual core keeps track of the position of the relative output. This form of multithreading can increase single threaded performance by allowing a single thread to use all resources of the CPU. The allocation of resources is dynamic on a near-single cycle latency level (1–4 cycles depending on the change in allocation depending on individual application needs. Therefore, if two virtual cores are competing for resources, there are appropriate algorithms in place to determine what resources are to be allocated where.