Editing Cray-1 (section)

===Vector machines===
In the STAR, new instructions essentially wrote the loops for the user. The user told the machine where in memory the list of numbers was stored, then fed in a single instruction <code>a(1..1000000) = addv b(1..1000000), c(1..1000000)</code>. At first glance it appears the savings are limited; in this case the machine fetches and decodes only a single instruction instead of 1,000,000, thereby saving 1,000,000 fetches and decodes, perhaps one-fourth of the overall time.

The real savings are not so obvious. Internally, the [[central processing unit|CPU]] of the computer is built up from a number of separate parts dedicated to a single task, for instance, adding a number, or fetching from memory. Normally, as the instruction flows through the machine, only one part is active at any given time. This means that each sequential step of the entire process must complete before a result can be saved. The addition of an [[instruction pipeline]] changes this. In such machines the CPU will "look ahead" and begin fetching succeeding instructions while the current instruction is still being processed. In this [[assembly line]] fashion any one instruction still requires as long to complete, but as soon as it finishes executing, the next instruction is right behind it, with most of the steps required for its execution already completed.

[[Vector processor]]s use this technique with one additional trick. Because the data layout is in a known format&nbsp;— a set of numbers arranged sequentially in memory&nbsp;— the pipelines can be tuned to improve the performance of fetches. On the receipt of a vector instruction, special hardware sets up the memory access for the arrays and stuffs the data into the processor as fast as possible.

CDC's approach in the STAR used what is today known as a ''memory-memory architecture''. This referred to the way the machine gathered data. It set up its pipeline to read from and write to memory directly. This allowed the STAR to use vectors of length not limited by the length of registers, making it highly flexible. Unfortunately, the pipeline had to be very long in order to allow it to have enough instructions in flight to make up for the slow memory. That meant the machine incurred a high cost when switching from processing vectors to performing operations on non-vector operands. Additionally, the low scalar performance of the machine meant that after the switch had taken place and the machine was running scalar instructions, the performance was quite poor{{citation needed|date=September 2012}}. The result was rather disappointing real-world performance, something that could, perhaps, have been forecast by [[Amdahl's law]].