Editing Threaded code (section)

==Threading models==
Practically all executable threaded code uses one or another of these methods for invoking subroutines (each method is called a "threading model").

===Direct threading===
Addresses in the thread are the addresses of machine language. This form is simple, but may have overheads because the thread consists only of machine addresses, so all further parameters must be loaded indirectly from memory. Some Forth systems produce direct-threaded code. On many machines direct-threading is faster than subroutine threading (see reference below).

An example of a [[stack machine]] might execute the sequence "push A, push B, add". That might be translated to the following thread and routines, where <code>ip</code> is initialized to the address labeled <code>thread</code> (i.e., the address where <code>&pushA</code> is stored).

<syntaxhighlight lang="c">
#define PUSH(x) (*sp++ = (x))
#define POP() (*--sp)
start:
  ip = &thread  // ip points to &pushA (which points to the first instruction of pushA)
  jump *ip++  // send control to first instruction of pushA and advance ip to &pushB
thread:
  &pushA
  &pushB
  &add
  ...
pushA:
  PUSH(A)
  jump *ip++ // send control where ip says to (i.e. to pushB) and advance ip
pushB:
  PUSH(B)
  jump *ip++
add:
  result = POP() + POP()
  PUSH(result)
  jump *ip++
</syntaxhighlight>

Alternatively, operands may be included in the thread. This can remove some indirection needed above, but makes the thread larger:

<syntaxhighlight lang="c">
#define PUSH(x) (*sp++ = (x))
#define POP() (*--sp)
start:
  ip = &thread
  jump *ip++
thread:
  &push
  &A  // address where A is stored, not literal A
  &push
  &B
  &add
  ...
push:
  variable_address = *ip++ // must move ip past operand address, since it is not a subroutine address
  PUSH(*variable_address) // Read value from variable and push on stack
  jump *ip++
add:
  result = POP() + POP()
  PUSH(result)
  jump *ip++
</syntaxhighlight>

===Indirect threading===
Indirect threading uses pointers to locations that in turn point to machine code. The indirect pointer may be followed by operands which are stored in the indirect "block" rather than storing them repeatedly in the thread. Thus, indirect code is often more compact than direct-threaded code.  The indirection typically makes it slower, though usually still faster than bytecode interpreters. Where the handler operands include both values and types, the space savings over direct-threaded code may be significant. Older FORTH systems typically produce indirect-threaded code.

For example, if the goal is to execute "push A, push B, add", the following might be used. Here, <code>ip</code> is initialized to address <code>&thread</code>, each code fragment (<code>push</code>, <code>add</code>) is found by double-indirecting through <code>ip</code> and an indirect block; and any operands to the fragment are found in the indirect block following the fragment's address. This requires keeping the ''current'' subroutine in <code>ip</code>, unlike all previous examples where it contained the ''next'' subroutine to be called.

<syntaxhighlight lang="c">
start:
  ip = &thread  // points to '&i_pushA'
  jump *(*ip)  // follow pointers to 1st instruction of 'push', DO NOT advance ip yet
thread:
  &i_pushA
  &i_pushB
  &i_add
  ...
i_pushA:
  &push
  &A
i_pushB:
  &push
  &B
i_add:
  &add
push:
  *sp++ = *(*ip + 1)  // look 1 past start of indirect block for operand address
  jump *(*++ip)  // advance ip in thread, jump through next indirect block to next subroutine
add:
  addend1 = *--sp
  addend2 = *--sp
  *sp++ = addend1 + addend2
  jump *(*++ip)
</syntaxhighlight>

===Subroutine threading===
So-called "subroutine-threaded code" (also "call-threaded code") consists of a series of machine-language "call" instructions (or addresses of functions to "call", as opposed to direct threading's use of "jump"). Early compilers for [[ALGOL]], Fortran, Cobol and some Forth systems often produced subroutine-threaded code. The code in many of these systems operated on a last-in-first-out (LIFO) stack of operands, for which compiler theory was well-developed. Most modern processors have special hardware support for subroutine "call" and "return" instructions, so the overhead of one extra machine instruction per dispatch is somewhat diminished.

Anton Ertl, the [[Gforth]] compiler's co-creator, stated that "in contrast to popular myths, subroutine threading is usually slower than direct threading".<ref>{{cite web| url=http://www.complang.tuwien.ac.at/forth/threaded-code.html#what| title=What is Threaded Code?| first=Anton| last=Ertl}}</ref> However, Ertl's most recent tests<ref name="tuwien1"/> show that subroutine threading is faster than direct threading in 15 out of 25 test cases. More specifically, he found that direct threading is the fastest threading model on Xeon, Opteron, and Athlon processors, indirect threading is fastest on Pentium M processors, and subroutine threading is fastest on Pentium 4, Pentium III, and PPC processors.

As an example of call threading for "push A, push B, add":

<syntaxhighlight lang="c">
thread:
  call pushA
  call pushB
  call add
  ret
pushA:
  *sp++ = A
  ret
pushB:
  *sp++ = B
  ret
add:
  addend1 = *--sp
  addend2 = *--sp
  *sp++ = addend1 + addend2
  ret
</syntaxhighlight>

===Token threading===
Token-threaded code implements the thread as a list of indices into a table of operations; the index width is naturally chosen to be as small as possible for density and efficiency. 1 byte / 8-bits is the natural choice for ease of programming, but smaller sizes like 4-bits, or larger like 12 or 16 bits, can be used depending on the number of operations supported. As long as the index width is chosen to be narrower than a machine pointer, it will naturally be more compact than the other threading types without much special effort by the programmer. It is usually half to three-fourths the size of other threadings, which are themselves a quarter to an eighth the size of non-threaded code. The table's pointers can either be indirect or direct. Some Forth compilers produce token-threaded code. Some programmers consider the "[[p-code machine|p-code]]" generated by some [[Pascal (programming language)|Pascal]] compilers, as well as the [[bytecode]]s used by [[.NET Framework|.NET]], [[Java (programming language)|Java]], BASIC and some [[C (programming language)|C]] compilers, to be token-threading.

A common approach, historically, is [[bytecode]], which typically uses 8-bit opcodes with a stack-based virtual machine. The archetypal bytecode [[Interpreter (computing)|interpreter]] is known as a "decode and dispatch interpreter" and follows the form:

<syntaxhighlight lang="c">
start:
  vpc = &thread
dispatch:
  addr = decode(&vpc) // Convert the next bytecode operation to a pointer to machine code that implements it
  // Any inter-instruction operations are performed here (e.g. updating global state, event processing, etc)
  jump addr
CODE_PTR decode(BYTE_CODE **p) {
  // In a more complex encoding, there may be multiple tables to choose between or control/mode flags
  return table[*(*p)++];
}
thread:  /* Contains bytecode, not machine addresses.  Hence it is more compact. */
  1 /*pushA*/
  2 /*pushB*/
  0 /*add*/
table:
  &add    /* table[0] = address of machine code that implements bytecode 0 */
  &pushA  /* table[1] ... */
  &pushB  /* table[2] ... */
pushA:
  *sp++ = A
  jump dispatch
pushB:
  *sp++ = B
  jump dispatch
add:
  addend1 = *--sp
  addend2 = *--sp
  *sp++ = addend1 + addend2
  jump dispatch
</syntaxhighlight>

If the virtual machine uses only byte-size instructions, <code>decode()</code> is simply a fetch from <code>thread</code>, but often there are commonly used 1-byte instructions plus some less-common multibyte instructions (see [[complex instruction set computer]]), in which case <code>decode()</code> is more complex. The decoding of single byte opcodes can be very simply and efficiently handled by a branch table using the opcode directly as an index.

For instructions where the individual operations are simple, such as "push" and "add", the [[Computational overhead|overhead]] involved in deciding what to execute is larger than the cost of actually executing it, so such interpreters are often much slower than machine code. However, for more complex ("compound") instructions, the overhead percentage is proportionally less significant.

There are times when token-threaded code can sometimes run faster than the equivalent machine code when that machine code ends up being too large to fit in the physical CPU's L1 instruction cache. The higher [[code density]] of threaded code, especially token-threaded code, can allow it to fit entirely in the L1 cache when it otherwise would not have, thereby avoiding cache thrashing. However, threaded code consumes both instruction cache (for the implementation of each operation) as well as data cache (for the bytecode and tables) unlike machine code which only consumes instruction cache; this means threaded code will eat into the budget for the amount of data that can be held for processing by the CPU at any given time. In any case, if the problem being computed involves applying a large number of operations to a small amount of data then using threaded code may be an ideal optimization.
<ref name="heller"/>

===Huffman threading===
Huffman threaded code consists of lists of tokens stored as [[Huffman code]]s. A Huffman code is a variable-length string of bits that identifies a unique token. A Huffman-threaded interpreter locates subroutines using an index table or a tree of pointers that can be navigated by the Huffman code. Huffman-threaded code is one of the most compact representations known for a computer program. The index and codes are chosen by measuring the frequency of calls to each subroutine in the code. Frequent calls are given the shortest codes. Operations with approximately equal frequencies are given codes with nearly equal bit-lengths. Most Huffman-threaded systems have been implemented as direct-threaded Forth systems, and used to pack large amounts of slow-running code into small, cheap [[microcontroller]]s. Most published<ref name=huffman>{{cite book |last1=Latendresse |first1=Mario |last2=Feeley |first2=Marc |title=Generation of Fast Interpreters for Huffman-Compressed Bytecode |citeseerx=10.1.1.156.2546 |publisher=Elsevier}}</ref> uses have been in smart cards, toys, calculators, and watches. The bit-oriented tokenized code used in [[PBASIC]] can be seen as a kind of Huffman-threaded code.

==={{anchor|String threading}}Lesser-used threading===
An example is string threading, in which operations are identified by strings, usually looked up by a hash table. This was used in Charles H. Moore's earliest Forth implementations and in the [[University of Illinois at Urbana–Champaign|University of Illinois]]'s experimental hardware-interpreted computer language. It is also used in [[Bashforth]].

===RPL===
[[Hewlett-Packard|HP]]'s [[RPL (programming language)|RPL]], first introduced in the [[HP-18C]] calculator in 1986, is a type of proprietary hybrid (direct-threaded and indirect-threaded) ''threaded interpretive language'' (TIL)<ref name="Loelinger_1981"/> that, unlike other TILs, allows embedding of RPL "objects" into the "runstream", i.e. the stream of addresses through which the interpreter pointer advances. An RPL "object" can be thought of as a special data type whose in-memory structure contains an address to an "object prolog" at the start of the object, and then data or executable code follows. The object prolog determines how the object's body should be executed or processed. Using the "RPL inner loop",<ref name="Busby_2018"/> which was invented and patented<ref name="Wickes_1986"/> by William C. Wickes in 1986 and published in 1988, execution follows like so:<ref name="Wickes_1988"/>

# Dereference the IP (instruction pointer) and store it into O (current object pointer)
# Increment the IP by the length of one address pointer
# Dereference O and store its address in O_1 (this is the second level of indirection)
# Transfer control to next pointer or embedded object by setting the PC (program counter) to O_1 plus one address pointer
# Go back to step 1

This can be represented more precisely by:

<pre>
    O  = [I]
    I  = I + Δ
    PC = [O] + Δ
</pre>

Where above, O is the current object pointer, I is the interpreter pointer, Δ is the length of one address word and the "[]" operator stands for "dereference".

When control is transferred to an object pointer or an embedded object, execution continues as follows:

<pre>
PROLOG -> PROLOG (The prolog address at the start of the prolog code points to itself)
    IF O + Δ =/= PC
    THEN GOTO INDIRECT (Test for direct execution)
        O = I - Δ (Correct O to point to start of embedded object)
        I = I + α (Correct I to point after embedded object where α is the length of the object)
    INDIRECT (Rest of prolog)
</pre>

On HP's [[HP Saturn|Saturn]] microprocessors that use RPL, there is a third level of indirection made possible by an architectural / programming trick which allows faster execution.<ref name="Busby_2018"/>