Editing Cray-3/SSS (section)

==Design==
The SSS project started after a [[Supercomputing Research Center]] (SRC) engineer, Ken Iobst, noticed a novel way to implement a parallel computer. Previous massively SIMD designs, like the [[Connection Machine]]s, consisted of a large number of individual processing elements consisting of a simple processor and some local memory. Results that needed to be passed from element to element were passed along networking links at relatively slow speeds. This was a serious bottleneck in most parallel designs, which limited their use to certain roles where these interdependencies could be reduced.

Iobst's idea was to use the super-fast scatter/gather hardware from the Cray-3 to move the data around instead of using a separate network. This would offer at least an order of magnitudes  better performance than systems based on "commodity" hardware. Better yet, the machine would still include a complete Cray-3 CPU,  allowing the machine as a whole to use either SIMD or vector instructions depending on the particulars of the problem.

Now all that remained was the selection of a processor. Since the Cray-3 already had a [[vector processor]] for heavy computing, the SIMD processors themselves could be considerably simpler, handling only the most basic instructions. This is where the SSS concept was truly unique; since the problem with most SIMD machines was moving data around, Iobst suggested that the processors be built into the [[Static random access memory|SRAM]] chips themselves.

Memory is normally organized within the RAM chips in a row/column format, with a controller on the chip reading requested data from the chip in parallel across the rows, then assembling the results into 32- or [[64-bit]] words for processing by the [[Central processing unit|CPU]]. In the SSS concept, the chips would also be equipped with a series of single-bit computers operating on a particular column of all the rows are at once—this meant that the processors could access data at very high speeds, about 100x as fast as normal. Add to this the speed of the "network" implemented by the scatter/gather hardware, and the system could be scaled to sizes considerably greater than existing SIMD systems.

Each processor could accept two commands every 200 nanoseconds, for an effective cycle rate of 100&nbsp;ns (10&nbsp;MHz). A fully equipped system with 1,024,000 processors would have an aggregate processing capability of 32&nbsp;TFlops.<ref>Ken Iobst	et al, [http://portal.acm.org/citation.cfm?id=620191 "Processing in Memory: The Terasys Massively Parallel PIM Array"], ''Computer'', Volume 28 Issue 4 (April 1995)</ref>