Editing Stack machine (section)

== Design ==
Most or all stack machine instructions assume that operands will be from the stack, and results placed in the stack. The stack easily holds more than two inputs or more than one result, so a rich set of operations can be computed. In stack machine code (sometimes called [[p-code]]), instructions will frequently have only an [[opcode]] commanding an operation, with no additional fields identifying a constant, register or memory cell, known as a '''zero address format'''.<ref name="Beard_1997"/> A computer that operates in such a way that the majority of its instructions do not include explicit addresses is said to utilize zero-address instructions.<ref>{{cite book |title=Computer Architecture and Organization |last=Hayes |first=John P. |isbn=0-07-027363-4 |date=1978 |publisher=McGraw-Hill International Book Company |page=164}}</ref> This greatly simplifies instruction decoding. Branches, load immediates, and load/store instructions require an argument field, but stack machines often arrange that the frequent cases of these still fit together with the opcode into a compact group of [[Bit|bits]]. The selection of operands from prior results is done implicitly by ordering the instructions. Some stack machine instruction sets are intended for interpretive execution of a virtual machine, rather than driving hardware directly.

Integer constant operands are pushed by {{code|Push}} or {{code|Load Immediate}} instructions. Memory is often accessed by separate {{code|Load}} or {{code|Store}} instructions containing a memory address or calculating the address from values in the stack. All practical stack machines have variants of the load–store opcodes for accessing [[local variable]]s and [[formal parameter]]s without explicit address calculations. This can be by offsets from the current top-of-stack address, or by offsets from a stable frame-base register.

The [[instruction set]] carries out most ALU actions with postfix ([[reverse Polish notation]]) operations that work only on the expression stack, not on data registers or main memory cells. This can be very convenient for executing high-level languages because most arithmetic expressions can be easily translated into postfix notation.

[[File:AST binary tree arith variables.svg|thumb|Binary syntax tree for the expression ''A''*(''B''-''C'') + (''D''+''E'')]]

For example, consider the expression ''A''*(''B''-''C'')+(''D''+''E''), written in reverse Polish notation as ''A'' ''B'' ''C'' - * ''D'' ''E'' + +. Compiling and running this on a simple imaginary stack machine would take the form:

                  # stack contents (leftmost = top = most recent):
  push A          #           A
  push B          #     B     A
  push C          # C   B     A
  subtract        #     B-C   A
  multiply        #           A*(B-C)
  push D          #     D     A*(B-C)
  push E          # E   D     A*(B-C)
  add             #     D+E   A*(B-C)
  add             #           A*(B-C)+(D+E)

The arithmetic operations 'subtract', 'multiply', and 'add' act on the two topmost operands of the stack. The computer takes both operands from the topmost (most recent) values of the stack. The computer replaces those two values with the calculated difference, sum, or product. In other words the instruction's operands are "popped" off the stack, and its result(s) are then "pushed" back onto the stack, ready for the next instruction.

Stack machines may have their expression stack and their [[call stack|call-return stack]] separated or as one integrated structure. If they are separated, the instructions of the stack machine can be [[instruction pipelining|pipelined]] with fewer interactions and less design complexity, so it will usually run faster.

Optimisation of compiled stack code is quite possible. Back-end optimisation of compiler output has been demonstrated to significantly improve code,<ref name="Koopman_1994"/><ref name="Bailey_2000"/> and potentially performance, whilst global optimisation within the compiler itself achieves further gains.<ref name="Shannon-Bailey_2006"/>

=== Stack storage ===
Some stack machines have a stack of limited size, implemented as a register file. The ALU will access this with an index. A large register file uses a lot of transistors and hence this method is only suitable for small systems. A few machines have both an expression stack in memory and a separate register stack. In this case, software, or an interrupt may move data between them. Some machines have a stack of unlimited size, implemented as an array in RAM, which is cached by some number of "top of stack" address registers to reduce memory access. Except for explicit "load from memory" instructions, the order of operand usage is identical with the order of the operands in the data stack, so excellent prefetching can be accomplished easily.

Consider {{code|X+1}}. It compiles to {{code|Load X}}; {{code|Load 1}}; {{code|Add}}. With a stack stored completely in RAM, this does implicit writes and reads of the in-memory stack:
* Load X, push to memory
* Load 1, push to memory
* Pop 2 values from memory, add, and push result to memory
for a total of 5 data cache references.

The next step up from this is a stack machine or interpreter with a single top-of-stack register. The above code then does:
* Load X into empty TOS register (if hardware machine) '''or''' Push TOS register to memory, Load X into TOS register (if interpreter)
* Push TOS register to memory, Load 1 into TOS register
* Pop left operand from memory, add to TOS register and leave it there
for a total of 3 data cache references, worst-case. Generally, interpreters don't track emptiness, because they don't have to—anything below the stack pointer is a non-empty value, and the TOS cache register is always kept hot. Typical Java interpreters do not buffer the top-of-stack this way, however, because the program and stack have a mix of short and wide data values.

If the hardwired stack machine has 2 or more top-stack registers, or a register file, then all memory access is avoided in this example and there is only 1 data cache cycle.