Editing Cray-1 (section)

==Background==
{{Main|Vector processor#Supercomputers}}
Typical scientific workloads consist of reading in large data sets, transforming them in some way and then writing them back out again. Normally the transformations being applied are identical across all of the data points in the set. For instance, the program might add 5 to every number in a set of a million numbers.

In simple computers the program would loop over all million numbers, adding five, thereby executing a million instructions saying <code>a = add b, c</code>. Internally the computer solves this instruction in several steps. First it reads the instruction from memory and decodes it, then it collects any additional information it needs, in this case the numbers b and c, and then finally runs the operation and stores the results. The end result is that the computer requires tens or hundreds of millions of cycles to carry out these operations.

===Vector machines===
In the STAR, new instructions essentially wrote the loops for the user. The user told the machine where in memory the list of numbers was stored, then fed in a single instruction <code>a(1..1000000) = addv b(1..1000000), c(1..1000000)</code>. At first glance it appears the savings are limited; in this case the machine fetches and decodes only a single instruction instead of 1,000,000, thereby saving 1,000,000 fetches and decodes, perhaps one-fourth of the overall time.

The real savings are not so obvious. Internally, the [[central processing unit|CPU]] of the computer is built up from a number of separate parts dedicated to a single task, for instance, adding a number, or fetching from memory. Normally, as the instruction flows through the machine, only one part is active at any given time. This means that each sequential step of the entire process must complete before a result can be saved. The addition of an [[instruction pipeline]] changes this. In such machines the CPU will "look ahead" and begin fetching succeeding instructions while the current instruction is still being processed. In this [[assembly line]] fashion any one instruction still requires as long to complete, but as soon as it finishes executing, the next instruction is right behind it, with most of the steps required for its execution already completed.

[[Vector processor]]s use this technique with one additional trick. Because the data layout is in a known format&nbsp;— a set of numbers arranged sequentially in memory&nbsp;— the pipelines can be tuned to improve the performance of fetches. On the receipt of a vector instruction, special hardware sets up the memory access for the arrays and stuffs the data into the processor as fast as possible.

CDC's approach in the STAR used what is today known as a ''memory-memory architecture''. This referred to the way the machine gathered data. It set up its pipeline to read from and write to memory directly. This allowed the STAR to use vectors of length not limited by the length of registers, making it highly flexible. Unfortunately, the pipeline had to be very long in order to allow it to have enough instructions in flight to make up for the slow memory. That meant the machine incurred a high cost when switching from processing vectors to performing operations on non-vector operands. Additionally, the low scalar performance of the machine meant that after the switch had taken place and the machine was running scalar instructions, the performance was quite poor{{citation needed|date=September 2012}}. The result was rather disappointing real-world performance, something that could, perhaps, have been forecast by [[Amdahl's law]].

===Cray's approach===
Cray studied the failure of the STAR and learned from it{{citation needed|date=November 2020}}. He decided that in addition to fast vector processing, his design would also require excellent all-around scalar performance. That way when the machine switched modes, it would still provide superior performance. Additionally he noticed that the workloads could be dramatically improved in most cases through the use of [[processor register|registers]].

Just as earlier machines had ignored the fact that most operations were being applied to many data points, the STAR ignored the fact that those same data points would be repeatedly operated on. Whereas the STAR would read and process the same memory five times to apply five vector operations on a set of data, it would be much faster to read the data into the CPU's registers once, and then apply the five operations. However, there were limitations with this approach. Registers were significantly more expensive in terms of circuitry, so only a limited number could be provided. This implied that Cray's design would have less flexibility in terms of vector sizes. Instead of reading any sized vector several times as in the STAR, the Cray-1 would have to read only a portion of the vector at a time, but it could then run several operations on that data prior to writing the results back to memory. Given typical workloads, Cray felt that the small cost incurred by being required to break large sequential memory accesses into segments was a cost well worth paying.

Since the typical vector operation would involve loading a small set of data into the vector registers and then running several operations on it, the vector system of the new design had its own separate pipeline. For instance, the multiplication and addition units were implemented as separate hardware, so the results of one could be internally pipelined into the next, the instruction decode having already been handled in the machine's main pipeline. Cray referred to this concept as ''chaining'', as it allowed programmers to "chain together" several instructions and extract higher performance.

===Performance===
In 1978, a team from the [[Argonne National Laboratory]] tested a variety of typical workloads on a Cray-1 as part of a proposal to purchase one for their use, replacing their [[IBM System/370|IBM 370/195]]. They also planned on testing on the [[CDC STAR-100]] and [[Burroughs Scientific Computer]], but such tests, if they were performed, were not published. The tests were run on the Cray-1 at the [[National Center for Atmospheric Research]] (NCAR) in [[Boulder, Colorado]]. The only other Cray available at the time was the one at Los Alamos, but accessing this machine required [[Q clearance]].<ref name=eval>{{cite tech report |first1=Larry |last1=Rudsinski |first2=Gail |last2=Pieper |title=Evaluating Computer Performance on the Cray-1 |date=January 1979 |publisher=Argonne National Laboratory |url=https://inis.iaea.org/collection/NCLCollectionStore/_Public/11/523/11523500.pdf}}</ref>

The tests were reported in two ways. The first was a minimum conversion needed to get the program running without errors, but making no attempt to take advantage of the Cray's vectorization. The second included a moderate set of updates to the code, often unwinding loops so they could be vectorized. Generally, the minimal conversions ran roughly the same speed as the 370 to about 2 times its performance (mostly due to a larger exponent range on the Cray), but vectorization led to further increases between 2.5 and 10 times. In one example program, which performed an internal [[fast Fourier transform]], performance improved from the IBM's 47 milliseconds to 3.<ref name=eval/>