Editing Stack machine (section)

==Comparison with register machines==
{{More citations needed section|date=May 2023}}
Stack machines are often compared to register machines, which hold values in an array of [[Processor register|register]]s. Register machines may store stack-like structures in this array, but a register machine has instructions which circumvent the stack interface. Register machines routinely outperform stack machines,<ref name="Shi-Gregg-Beatty-Ertl_2005"/> and stack machines have remained a niche player in hardware systems. But stack machines are often used in implementing [[virtual machine]]s because of their simplicity and ease of implementation.<ref name="Randall_2004"/>

=== Instructions ===
Stack machines have higher [[Instruction set architecture#Code density|code density]]. In contrast to common stack machine instructions which can easily fit in 6 bits or less, register machines require two or three register-number fields per ALU instruction to select operands; the densest register machines average about 16 bits per instruction plus the operands. Register machines also use a wider offset field for load-store opcodes. A stack machine's compact code naturally fits more instructions in cache, and therefore could achieve better [[CPU cache|cache]] efficiency, reducing memory costs or permitting faster memory systems for a given cost. In addition, most stack-machine instructions are very simple, made from only one opcode field or one operand field. Thus, stack-machines require very little electronic resources to decode each instruction.

A program has to execute more instructions when compiled to a stack machine than when compiled to a register machine or memory-to-memory machine. Every variable load or constant requires its own separate Load instruction, instead of being bundled within the instruction which uses that value. The separated instructions may be simple and faster running, but the total instruction count is still higher.

Most register interpreters specify their registers by number. But a host machine's registers can't be accessed in an indexed array, so a memory array is allotted for virtual registers. Therefore, the instructions of a register interpreter must use memory for passing generated data to the next instruction. This forces register interpreters to be much slower on microprocessors made with a fine process rule (i.e. faster transistors without improving circuit speeds, such as the Haswell x86). These require several clocks for memory access, but only one clock for register access. In the case of a stack machine with a data forwarding circuit instead of a register file, stack interpreters can allot the host machine's registers for the top several operands of the stack instead of the host machine's memory

In a stack machine, the operands used in the instructions are always at a known offset (set in the stack pointer), from a fixed location (the bottom of the stack, which in a hardware design might always be at memory location zero), saving precious in-[[Cache (computing)|cache]] or in-[[CPU]] storage from being used to store quite so many [[memory address]]es or index numbers. This may preserve such registers and cache for use in non-flow computation.{{citation needed|date=September 2020}}

=== Temporary / local values ===
Some in the industry believe that stack machines execute more [[data cache]] cycles for temporary values and local variables than do register machines.<ref name="Hennessy-Patterson"/>

On stack machines, temporary values often get spilled into memory, whereas on machines with many registers these temps usually remain in registers. (However, these values often need to be spilled into "activation frames" at the end of a procedure's definition, basic block, or at the very least, into a memory buffer during interrupt processing). Values spilled to memory add more cache cycles. This spilling effect depends on the number of hidden registers used to buffer top-of-stack values, upon the frequency of nested procedure calls, and upon host computer interrupt processing rates.

On register machines using optimizing compilers, it is very common for the most-used local variables to remain in registers rather than in stack frame memory cells. This eliminates most data cache cycles for reading and writing those values. The development of "stack scheduling" for performing [[live-variable analysis]], and thus retaining key variables on the stack for extended periods, helps this concern.<ref name="Koopman_1994" /><ref name="Bailey_2000" /><ref name="Shannon-Bailey_2006" />

On the other hand, register machines must spill many of their registers to memory across nested procedure calls. The decision of which registers to spill, and when, is made statically at compile time rather than on the dynamic depth of the calls. This can lead to more data cache traffic than in an advanced stack machine implementation.

===Common subexpressions===
In register machines, a [[common subexpression]] (a subexpression which is used multiple times with the same result value) can be evaluated just once and its result saved in a fast register. The subsequent reuses have no time or code cost, just a register reference. This optimization speeds simple expressions (for example, loading variable X or pointer P) as well as less-common complex expressions.

With stack machines, in contrast, results can be stored in one of two ways. Firstly, results can be stored using a temporary variable in memory. Storing and subsequent retrievals cost additional instructions and additional data cache cycles. Doing this is only a win if the subexpression computation costs more in time than fetching from memory, which in most stack CPUs, almost always is the case. It is never worthwhile for simple variables and pointer fetches, because those already have the same cost of one data cache cycle per access. It is only marginally worthwhile for expressions such as {{code|X+1}}. These simpler expressions make up the majority of redundant, optimizable expressions in programs written in languages other than [[concatenative programming language|concatenative languages]]. An optimizing compiler can only win on redundancies that the programmer could have avoided in the source code.{{cn|date=May 2023}}

The second way leaves a computed value on the data stack, duplicating it as needed. This uses operations to copy stack entries. The stack must be depth shallow enough for the CPU's available copy instructions. Hand-written stack code often uses this approach, and achieves speeds like general-purpose register machines.<ref name="Koopman_1989"/><ref name="LaForest_2007"/> Unfortunately, algorithms for optimal "stack scheduling" are not in wide use by programming languages.

===Pipelining===
In modern machines, the time to fetch a variable from the data cache is often several times longer than the time needed for basic ALU operations. A program runs faster without stalls if its memory loads can be started several cycles before the instruction that needs that variable. Complex machines can do this with a deep pipeline and "out-of-order execution" that examines and runs many instructions at once. Register machines can even do this with much simpler "in-order" hardware, a shallow pipeline, and slightly smarter compilers. The load step becomes a separate instruction, and that instruction is statically scheduled much earlier in the code sequence. The compiler puts independent steps in between.

Scheduling memory accesses requires explicit, spare registers. It is not possible on stack machines without exposing some aspect of the micro-architecture to the programmer. For the expression A B -, B must be evaluated and pushed immediately prior to the Minus step. Without stack permutation or hardware multithreading, relatively little useful code can be put in between while waiting for the Load B to finish. Stack machines can work around the memory delay by either having a deep out-of-order execution pipeline covering many instructions at once, or more likely, they can permute the stack such that they can work on other workloads while the load completes, or they can interlace the execution of different program threads, as in the Unisys A9 system.<ref name="Burroughs_1986"/> Today's increasingly parallel computational loads suggests, however, this might not be the disadvantage it's been made out to be in the past.

Stack machines can omit the operand fetching stage of a register machine.<ref name="Koopman_1989"/> For example, in the [[Java Optimized Processor]] (JOP) microprocessor the top 2 operands of stack directly enter a data forwarding circuit that is faster than the register file.<ref name="Jopdesign"/>

===Out-of-order execution===
The [[Tomasulo algorithm]] finds [[instruction-level parallelism]] by issuing instructions as their data becomes available. Conceptually, the addresses of positions in a stack are no different than the register indexes of a register file. This view permits the [[out-of-order execution]] of the Tomasulo algorithm to be used with stack machines.

Out-of-order execution in stack machines seems to reduce or avoid many theoretical and practical difficulties.<ref name="Chatterji-Ravindran"/> The cited research shows that such a stack machine can exploit instruction-level parallelism, and the resulting hardware must cache data for the instructions. Such machines effectively bypass most memory accesses to the stack. The result achieves throughput (instructions per [[Clock rate|clock]]) comparable to [[load–store architecture]] machines, with much higher code densities (because operand addresses are implicit).

One issue brought up in the research was that it takes about 1.88 stack-machine instructions to do the work of one instruction on a load-store architecture machine.{{cn|date=May 2023}} Competitive out-of-order stack machines therefore require about twice as many electronic resources to track instructions ("issue stations"). This might be compensated by savings in instruction cache and memory and instruction decoding circuits.

===Hides a faster register machine inside===
Some simple stack machines have a chip design which is fully customized all the way down to the level of individual registers. The top of stack address register and the N top of stack data buffers are built from separate individual register circuits, with separate adders and ad hoc connections.

However, most stack machines are built from larger circuit components where the N data buffers are stored together within a register file and share read/write buses. The decoded stack instructions are mapped into one or more sequential actions on that hidden register file. Loads and ALU ops act on a few topmost registers, and implicit spills and fills act on the bottommost registers. The decoder allows the instruction stream to be compact. But if the code stream instead had explicit register-select fields which directly manipulated the underlying register file, the compiler could make better use of all registers and the program would run faster.

[[Microcode|Microprogrammed]] stack machines are an example of this. The inner microcode engine is some kind of RISC-like register machine or a [[VLIW]]-like machine using multiple register files. When controlled directly by task-specific microcode, that engine gets much more work completed per cycle than when controlled indirectly by equivalent stack code for that same task.

The object code translators for the [[HP 3000]] and [[Tandem Computers|Tandem]] T/16 are another example.<ref name="Bergh-Keilman-Magenheimer-Miller_1987"/><ref name="Andrews-Sand_1992"/> They translated stack code sequences into equivalent sequences of RISC code. Minor 'local' optimizations removed much of the overhead of a stack architecture. Spare registers were used to factor out repeated address calculations. The translated code still retained plenty of emulation overhead from the mismatch between original and target machines. Despite that burden, the cycle efficiency of the translated code matched the cycle efficiency of the original stack code. And when the source code was recompiled directly to the register machine via optimizing compilers, the efficiency doubled. This shows that the stack architecture and its non-optimizing compilers were wasting over half of the power of the underlying hardware.

Register files are good tools for computing because they have high bandwidth and very low latency, compared to memory references via data caches. In a simple machine, the register file allows reading two independent registers and writing of a third, all in one ALU cycle with one-cycle or less latency. Whereas the corresponding data cache can start only one read or one write (not both) per cycle, and the read typically has a latency of two ALU cycles. That's one third of the throughput at twice the pipeline delay. In a complex machine like [[Athlon]] that completes two or more instructions per cycle, the register file allows reading of four or more independent registers and writing of two others, all in one ALU cycle with one-cycle latency. Whereas the corresponding dual-ported data cache can start only two reads or writes per cycle, with multiple cycles of latency. Again, that's one third of the throughput of registers. It is very expensive to build a cache with additional ports.

Since a stack is a component of most software programs, even when the software used is not strictly a stack machine, a hardware stack machine might more closely mimic the inner workings of its programs. Processor registers have a high thermal cost, and a stack machine might claim higher energy efficiency.<ref name="GreenArrays_1"/>

=== Interrupts ===
Responding to an interrupt involves saving the registers to a stack, and then branching to the interrupt handler code. Often stack machines respond more quickly to interrupts, because most parameters are already on a stack and there is no need to push them there. Some register machines deal with this by having multiple register files that can be instantly swapped<ref name="Intel_1980"/> but this increases costs and slows down the register file.

=== Interpreters ===
Interpreters for virtual stack machines are easier to build than interpreters for register machines; the logic for handling memory address modes is in just one place rather than repeated in many instructions. Stack machines also tend to have fewer variations of an opcode; one generalized opcode will handle both frequent cases and obscure corner cases of memory references or function call setup. (But code density is often improved by adding short and long forms for the same operation.)

Interpreters for virtual stack machines are often slower than interpreters for other styles of virtual machine.<ref name="Shi-Gregg-Beatty-Ertle_2"/> This slowdown is worst when running on host machines with deep execution pipelines, such as current x86 chips.

In some interpreters, the interpreter must execute a N-way switch jump to decode the next opcode and branch to its steps for that particular opcode. Another method for selecting opcodes is [[threaded code]]. The host machine's prefetch mechanisms are unable to predict and fetch the target of that indexed or indirect jump. So the host machine's execution pipeline must restart each time the hosted interpreter decodes another virtual instruction. This happens more often for virtual stack machines than for other styles of virtual machine.<ref name="Davis-Beatty-Casey-Gregg-Waldron_2005"/>

One example is the [[Java (programming language)|Java]] programming language. 
Its canonical [[virtual machine]] is specified as an 8-bit stack machine. However, the [[Dalvik (software)|Dalvik]] virtual machine for Java used on [[Android (operating system)|Android]] [[smartphones]] is a 16-bit virtual-register machine - a choice made for efficiency reasons. Arithmetic instructions directly fetch or store local variables via 4-bit (or larger) instruction fields.<ref name="Bornstein_2008"/> Similarly version 5.0 of Lua replaced its virtual stack machine with a faster virtual register machine.<ref name="Lua5"/><ref name="Lua5_VM"/>

Since Java virtual machine became popular, microprocessors have employed advanced [[branch predictor]]s for indirect jumps.<ref name="inria"/> This advance avoids most of pipeline restarts from N-way jumps and eliminates much of the instruction count costs that affect stack interpreters.