Editing Cell (processor) (section)

==Architecture==
[[File:Schema Cell.png|thumb]]
While the Cell chip can have a number of different configurations, the basic configuration is a [[multi-core (computing)|multi-core]] chip composed of one "Power Processor Element" ("PPE") (sometimes called "Processing Element", or "PE"), and multiple "Synergistic Processing Elements" ("SPE").<ref name="cellbriefing">{{Cite news |date=February 7, 2005 |title=Cell Microprocessor Briefing |url=http://pc.watch.impress.co.jp/docs/2005/0208/kaigai153.htm |publisher=IBM, Sony Computer Entertainment Inc., Toshiba Corp.}}</ref> The PPE and SPEs are linked together by an internal high speed bus dubbed "Element Interconnect Bus" ("EIB").

===Power Processor Element (PPE)===
{{Main | Power Processing Element}}
[[File:PPE (Cell).png|thumb|PPE]]
The ''PPE''<ref name="cc.gatech.edu">{{Cite web |last=Kim |first=Hyesoon |author-link=Hyesoon Kim |date=Spring 2011 |title=CS4803DGC Design and Programming of Game Console |url=https://faculty.cc.gatech.edu/~hyesoon/spr11/lec_cell.pdf}}</ref><ref>{{Cite book |last=Koranne |first=Sandeep |url=https://books.google.com/books?id=f9FxS-mdF8UC&pg=PA19 |title=Practical Computing on the Cell Broadband Engine |date=2009 |publisher=Springer Science+Business Media |isbn=9781441903082 |page=19}}</ref><ref>{{Cite web |last=Hofstee |first=H. Peter |date=2005 |title=All About the Cell Processor |url=http://www.research.ibm.com/people/a/ashwini/E3%202005%20Cell%20Blade%20reports/All_About_Cell_Cool_Chips_Final.pdf |url-status=dead |archive-url=https://web.archive.org/web/20110906154333/http://www.research.ibm.com/people/a/ashwini/E3%202005%20Cell%20Blade%20reports/All_About_Cell_Cool_Chips_Final.pdf |archive-date=September 6, 2011}}</ref> is the [[PowerPC]] based, dual-issue in-order two-way [[Simultaneous multithreading|simultaneous-multithreaded]] [[CPU]] core with a 23-stage pipeline acting as the controller for the eight SPEs, which handle most of the computational workload. PPE has limited out of order execution capabilities; it can perform loads out of order and has delayed execution pipelines. The PPE will work with conventional operating systems due to its similarity to other 64-bit PowerPC processors, while the SPEs are designed for vectorized floating point code execution. The PPE contains a 32 [[KiB]] level 1 instruction [[CPU cache|cache]], a 32 KiB level 1 data cache, and a 512 KiB level 2 cache. The size of a cache line is 128 bytes in all caches.<ref name="cbe-programming-handbok" />{{rp|pages=136–137,141}} Additionally, IBM has included an [[AltiVec]] (VMX) unit<ref name="seminar">{{Cite news |date=February 16, 2005 |title=Power Efficient Processor Design and the Cell Processor |url=http://www.cerc.utexas.edu/vlsi-seminar/spring05/slides/2005.02.16.hph.pdf |url-status=dead |archive-url=https://web.archive.org/web/20050426183838/http://www.cerc.utexas.edu/vlsi-seminar/spring05/slides/2005.02.16.hph.pdf |archive-date=April 26, 2005 |access-date=June 12, 2005 |publisher=IBM}}</ref> which is fully pipelined for [[single precision]] floating point (Altivec 1 does not support [[double precision]] floating-point vectors.), 32-bit [[Arithmetic logic unit|Fixed Point Unit (FXU)]] with 64-bit register file per thread, [[Load–store unit|Load and Store Unit (LSU)]], 64-bit [[Floating-point unit|Floating-Point Unit (FPU)]], [[Branch predictor|Branch Unit (BRU)]] and Branch Execution Unit(BXU).<ref name="cc.gatech.edu" />
PPE consists of three main units: Instruction Unit (IU), Execution Unit (XU), and vector/scalar execution unit (VSU). IU contains L1 instruction cache, branch prediction hardware, instruction buffers, and dependency checking logic. XU contains integer execution units (FXU) and load-store unit (LSU). VSU contains all of the execution resources for FPU and VMX. Each PPE can complete two double-precision operations per clock cycle using a scalar fused-multiply-add instruction, which translates to 6.4&nbsp;[[GFLOPS]] at 3.2&nbsp;GHz; or eight single-precision operations per clock cycle with a vector fused-multiply-add instruction, which translates to 25.6&nbsp;GFLOPS at 3.2&nbsp;GHz.<ref name="pacellperf">{{Cite web |last=Chen |first=Thomas |last2=Raghavan |first2=Ram |last3=Dale |first3=Jason |last4=Iwata |first4=Eiji |date=November 29, 2005 |title=Cell Broadband Engine Architecture and its first implementation |url=http://www.ibm.com/developerworks/power/library/pa-cellperf/ |url-status=dead |archive-url=https://web.archive.org/web/20121027092540/http://www.ibm.com/developerworks/power/library/pa-cellperf/ |archive-date=October 27, 2012 |access-date=September 9, 2012 |website=IBM developerWorks}}</ref><!-- use of KiB is intentional, please do not modify -->

====Xenon in Xbox 360====
The PPE was designed specifically for the Cell processor but during development, [[Microsoft]] approached IBM wanting a high-performance processor core for its [[Xbox 360]]. IBM complied and made the tri-core [[Xenon (processor)|Xenon processor]], based on a slightly modified version of the PPE with added VMX128 extensions.<ref>{{Cite web |last=Alexander |first=Leigh |date=January 16, 2009 |title=Processing The Truth: An Interview With David Shippy] |url=https://www.gamedeveloper.com/business/processing-the-truth-an-interview-with-david-shippy |website=[[Gamasutra]]}}</ref><ref>{{Cite news |last=Last |first=Jonathan V. |date=December 30, 2008 |title=Playing the Fool |url=https://www.wsj.com/articles/SB123069467545545011 |work=[[Wall Street Journal]]}}</ref>

===Synergistic Processing Element (SPE){{anchor|SPE}}===
{{hatnote|Not to be confused with Signal Processing Engine (SPE), an extension found on [[PowerPC e500]].}}
[[File:SPE (cell).png|thumb|SPE]]
Each SPE is a dual issue in order processor composed of a "Synergistic Processing Unit",<ref>{{Cite book |url=https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/02E544E65760B0BF87257060006F8F20/$file/SPU_ABI-Specification_1.9.pdf |title=SPU Application Binary Interface Specification |date=July 18, 2008 |access-date=January 24, 2015 |archive-url=https://web.archive.org/web/20141118214923/https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/02E544E65760B0BF87257060006F8F20/$file/SPU_ABI-Specification_1.9.pdf |archive-date=November 18, 2014 |url-status=dead}}</ref> SPU, and a "Memory Flow Controller", MFC ([[Direct memory access|DMA]], [[Memory management unit|MMU]], and [[Bus (computing)|bus]] interface). SPEs do not have any [[branch prediction]] hardware (hence there is a heavy burden on the compiler).<ref name="ibmresearch">{{Cite web |title=IBM Research - Cell |url=http://www.research.ibm.com/cell/ |url-status=dead |archive-url=https://web.archive.org/web/20050614003851/http://www.research.ibm.com/cell/ |archive-date=June 14, 2005 |access-date=June 11, 2005 |website=IBM}}</ref> Each SPE has 6 execution units divided among odd and even pipelines on each SPE : The SPU runs a specially developed [[instruction set]] (ISA) with [[128-bit]] [[SIMD]] organization<ref name="seminar" /><ref name="ibmrpaper" /><ref name="spearch">{{Cite web |date=August 15, 2005 |title=A novel SIMD architecture for the Cell heterogeneous chip-multiprocessor |url=http://www.hotchips.org/archives/hc17/2_Mon/HC17.S1/HC17.S1T1.pdf |url-status=dead |archive-url=https://web.archive.org/web/20080709051040/http://www.hotchips.org/archives/hc17/2_Mon/HC17.S1/HC17.S1T1.pdf |archive-date=July 9, 2008 |access-date=January 1, 2006 |publisher=Hot Chips 17 |df=mdy-all}}</ref> for single and double precision instructions. With the current generation of the Cell, each SPE contains a 256&nbsp;[[KiB]] [[1T-SRAM|embedded SRAM]] for instruction and data, called [[Scratchpad memory|"Local Storage"]] (not to be mistaken for "Local Memory" in Sony's documents that refer to the VRAM) which is visible to the PPE and can be addressed directly by software. Each SPE can support up to 4 [[GiB]] of local store memory. The local store does not operate like a conventional [[CPU cache]] since it is neither transparent to software nor does it contain hardware structures that predict which data to load. The SPEs contain a 128-bit, 128-entry [[register file]] and measures 14.5&nbsp;mm<sup>2</sup> on a 90&nbsp;nm process. An SPE can operate on sixteen 8-bit integers, eight 16-bit integers, four 32-bit integers, or four single-precision floating-point numbers in a single clock cycle, as well as a memory operation. Note that the SPU cannot directly access system memory; the 64-bit virtual memory addresses formed by the SPU must be passed from the SPU to the SPE memory flow controller (MFC) to set up a DMA operation within the system address space.
<!-- Far from perfect but trending toward accuracy. Could not find either the virtual address range limit or the physical address range limit. Note that a "system address" on the SPU is an address passed to the SPU DMA controller; the LS has only 2^14 addressable locations (256K/16B) ~~~~ -->

In one typical usage scenario, the system will load the SPEs with small programs (similar to [[thread (computing)|threads]]), chaining the SPEs together to handle each step in a complex operation. For instance, a [[set-top box]] might load programs for reading a DVD, video and audio decoding, and display and the data would be passed off from SPE to SPE until finally ending up on the TV. Another possibility is to partition the input data set and have several SPEs performing the same kind of operation in parallel. At 3.2&nbsp;GHz, each SPE gives a theoretical 25.6 [[GFLOPS]] of single-precision performance.

Compared to its [[personal computer]] contemporaries, the relatively high overall floating-point performance of a Cell processor seemingly dwarfs the abilities of the SIMD unit in CPUs like the [[Pentium 4]] and the [[Athlon 64]]. However, comparing only floating-point abilities of a system is a one-dimensional and application-specific metric. Unlike a Cell processor, such desktop CPUs are more suited to the general-purpose software usually run on personal computers. In addition to executing multiple instructions per clock, processors from Intel and AMD feature [[branch predictor]]s. The Cell is designed to compensate for this with compiler assistance, in which prepare-to-branch instructions are created. For double-precision floating-point operations, as sometimes used in personal computers and often used in scientific computing, Cell performance drops by an order of magnitude, but still reaches 20.8&nbsp;GFLOPS (1.8&nbsp;GFLOPS per SPE, 6.4&nbsp;GFLOPS per PPE). The PowerXCell 8i variant, which was specifically designed for double-precision, reaches 102.4&nbsp;GFLOPS in double-precision calculations.<ref name="ppcnuxpowerxcell">{{Cite web |date=November 2007 |title=Cell successor with turbo mode - PowerXCell 8i |url=http://www.ppcnux.com/?q=node/7144 |url-status=dead |archive-url=https://web.archive.org/web/20090110230213/http://www.ppcnux.com/?q=node/7144 |archive-date=January 10, 2009 |access-date=June 10, 2008 |publisher=PPCNux}}</ref>

Tests by IBM show that the SPEs can reach 98% of their theoretical peak performance running optimized parallel matrix multiplication.<ref name="pacellperf" />

[[Toshiba]] has developed a [[co-processor]] powered by four SPEs, but no PPE, called the [[SpursEngine]] designed to accelerate 3D and movie effects in consumer electronics.

Each SPE has a local memory of 256 KB.<ref>{{Cite web |title=Supporting OpenMP on Cell |url=http://researcher.watson.ibm.com/researcher/files/us-zsura/iwomp07_cellOMP.pdf |url-status=dead |archive-url=https://web.archive.org/web/20190108125436/https://researcher.watson.ibm.com/researcher/files/us-zsura/iwomp07_cellOMP.pdf |archive-date=January 8, 2019 |website=[[Thomas J. Watson Research Center|IBM T. J Watson Research]]}}</ref> In total, the SPEs have 2 MB of local memory.

===Element Interconnect Bus (EIB)===
The EIB is a communication bus internal to the Cell processor which connects the various on-chip system elements: the PPE processor, the memory controller (MIC), the eight SPE coprocessors, and two off-chip I/O interfaces, for a total of 12 participants in the PS3 (the number of SPU can vary in industrial applications). The EIB also includes an arbitration unit which functions as a set of traffic lights. In some documents, IBM refers to EIB participants as 'units'.

The EIB is presently implemented as a circular ring consisting of four 16-byte-wide unidirectional channels which counter-rotate in pairs. When traffic patterns permit, each channel can convey up to three transactions concurrently. As the EIB runs at half the system clock rate the effective channel rate is 16 bytes every two system clocks. At maximum [[Concurrency (computer science)|concurrency]], with three active transactions on each of the four rings, the peak instantaneous EIB bandwidth is 96 bytes per clock (12 concurrent transactions × 16 bytes wide / 2 system clocks per transfer). While this figure is often quoted in IBM literature, it is unrealistic to simply scale this number by processor clock speed. The arbitration unit [[#Bandwidth assessment|imposes additional constraints]].

IBM Senior Engineer [[David Krolak]], EIB lead designer, explains the concurrency model:
{{blockquote|A ring can start a new op every three cycles. Each transfer always takes eight beats. That was one of the simplifications we made, it's optimized for streaming a lot of data. If you do small ops, it does not work quite as well. If you think of eight-car trains running around this track, as long as the trains aren't running into each other, they can coexist on the track.<ref name="Krolak">{{Cite web |date=2005-12-06 |title=Meet the experts: David Krolak on the Cell Broadband Engine EIB bus |url=http://www.ibm.com/developerworks/power/library/pa-expert9/ |access-date=2007-03-18 |publisher=IBM}}</ref>}}

Each participant on the EIB has one 16-byte read port and one 16-byte write port. The limit for a single participant is to read and write at a rate of 16 bytes per EIB clock (for simplicity often regarded 8 bytes per system clock). Each SPU processor contains a dedicated [[Direct memory access|DMA]] management queue capable of scheduling long sequences of transactions to various endpoints without interfering with the SPU's ongoing computations; these DMA queues can be managed locally or remotely as well, providing additional flexibility in the control model.

Data flows on an EIB channel stepwise around the ring. Since there are twelve participants, the total number of steps around the channel back to the point of origin is twelve. Six steps is the longest distance between any pair of participants. An EIB channel is not permitted to convey data requiring more than six steps; such data must take the shorter route around the circle in the other direction. The number of steps involved in sending the packet has very little impact on transfer latency: the clock speed driving the steps is very fast relative to other considerations. However, longer communication distances are detrimental to the overall performance of the EIB as they reduce available concurrency. <!-- thinking about the Krolak interview, I have no justification for using the term hops, they could be opening the circuit end to end for the transaction; still, it seems more likely that it functions in hops and I do not feel like rewriting this passage right now; changed to steps and stepwise after seeing a comment by HappyVR using this term instead ~~~~ -->

Despite IBM's original desire to implement the EIB as a more powerful cross-bar, the circular configuration they adopted to spare resources rarely represents a limiting factor on the performance of the Cell chip as a whole. In the worst case, the programmer must take extra care to schedule communication patterns where the EIB is able to function at high concurrency levels.

David Krolak explained:
{{blockquote|Well, in the beginning, early in the development process, several people were pushing for a crossbar switch, and the way the bus is designed, you could actually pull out the EIB and put in a crossbar switch if you were willing to devote more silicon space on the chip to wiring. We had to find a balance between connectivity and area, and there just was not enough room to put a full crossbar switch in. So we came up with this ring structure which we think is very interesting. It fits within the area constraints and still has very impressive bandwidth.<ref name="Krolak" />}}

====Bandwidth assessment====
At 3.2&nbsp;GHz, each channel flows at a rate of 25.6&nbsp;GB/s. Viewing the EIB in isolation from the system elements it connects, achieving twelve concurrent transactions at this flow rate works out to an abstract EIB bandwidth of 307.2&nbsp;GB/s. Based on this view many IBM publications depict available EIB bandwidth as "greater than 300&nbsp;GB/s". This number reflects the peak instantaneous EIB bandwidth scaled by processor frequency.<ref>{{Cite web |title=Cell Multiprocessor Communication Network: Built for Speed |url=http://hpc.pnl.gov/people/fabrizio/papers/ieeemicro-cell.pdf |url-status=dead |archive-url=https://web.archive.org/web/20070107202021/http://hpc.pnl.gov/people/fabrizio/papers/ieeemicro-cell.pdf |archive-date=January 7, 2007 |access-date=March 22, 2007 |publisher=IEEE}}</ref>

However, other technical restrictions are involved in the arbitration mechanism for packets accepted onto the bus. The IBM Systems Performance group explained:
{{blockquote|Each unit on the EIB can simultaneously send and receive 16 bytes of data every bus cycle. The maximum data bandwidth of the entire EIB is limited by the maximum rate at which addresses are snooped across all units in the system, which is one per bus cycle. Since each snooped address request can potentially transfer up to 128 bytes, the theoretical peak data bandwidth on the EIB at 3.2&nbsp;GHz is 128Bx1.6&nbsp;GHz {{=}} 204.8&nbsp;GB/s.<ref name="pacellperf" />}}

This quote apparently represents the full extent of IBM's public disclosure of this mechanism and its impact. The EIB arbitration unit, the snooping mechanism, and interrupt generation on segment or page translation faults are not well described in the documentation set as yet made public by IBM.{{Citation needed|date=June 2009}}

In practice, effective EIB bandwidth can also be limited by the ring participants involved. While each of the nine processing cores can sustain 25.6&nbsp;GB/s read and write concurrently, the memory interface controller (MIC) is tied to a pair of XDR memory channels permitting a maximum flow of 25.6&nbsp;GB/s for reads and writes combined and the two IO controllers are documented as supporting a peak combined input speed of 25.6&nbsp;GB/s and a peak combined output speed of 35&nbsp;GB/s.

To add further to the confusion, some older publications cite EIB bandwidth assuming a 4&nbsp;GHz system clock. This reference frame results in an instantaneous EIB bandwidth figure of 384&nbsp;GB/s and an arbitration-limited bandwidth figure of 256&nbsp;GB/s.

All things considered the theoretic 204.8&nbsp;GB/s number most often cited is the best one to bear in mind. The ''IBM Systems Performance'' group has demonstrated SPU-centric data flows achieving 197&nbsp;GB/s on a Cell processor running at 3.2&nbsp;GHz so this number is a fair reflection on practice as well.<ref name="pacellperf" />

===Memory and I/O controllers===
Cell contains a dual channel [[Rambus]] XIO macro which interfaces to Rambus [[XDR DRAM|XDR memory]]. The memory interface controller (MIC) is separate from the XIO macro and is designed by IBM. The XIO-XDR link runs at 3.2&nbsp;Gbit/s per pin. Two 32-bit channels can provide a theoretical maximum of 25.6&nbsp;GB/s.

The I/O interface, also a Rambus design, is known as [[FlexIO]]. The FlexIO interface is organized into 12 lanes, each lane being a unidirectional 8-bit wide point-to-point path. Five 8-bit wide point-to-point paths are inbound lanes to Cell, while the remaining seven are outbound. This provides a theoretical peak bandwidth of 62.4&nbsp;GB/s (36.4&nbsp;GB/s outbound, 26&nbsp;GB/s inbound) at 2.6&nbsp;GHz. The FlexIO interface can be clocked independently, typ. at 3.2&nbsp;GHz. 4 inbound + 4 outbound lanes are supporting memory coherency.