Editing Threaded code (section)

===Token threading===
Token-threaded code implements the thread as a list of indices into a table of operations; the index width is naturally chosen to be as small as possible for density and efficiency. 1 byte / 8-bits is the natural choice for ease of programming, but smaller sizes like 4-bits, or larger like 12 or 16 bits, can be used depending on the number of operations supported. As long as the index width is chosen to be narrower than a machine pointer, it will naturally be more compact than the other threading types without much special effort by the programmer. It is usually half to three-fourths the size of other threadings, which are themselves a quarter to an eighth the size of non-threaded code. The table's pointers can either be indirect or direct. Some Forth compilers produce token-threaded code. Some programmers consider the "[[p-code machine|p-code]]" generated by some [[Pascal (programming language)|Pascal]] compilers, as well as the [[bytecode]]s used by [[.NET Framework|.NET]], [[Java (programming language)|Java]], BASIC and some [[C (programming language)|C]] compilers, to be token-threading.

A common approach, historically, is [[bytecode]], which typically uses 8-bit opcodes with a stack-based virtual machine. The archetypal bytecode [[Interpreter (computing)|interpreter]] is known as a "decode and dispatch interpreter" and follows the form:

<syntaxhighlight lang="c">
start:
  vpc = &thread
dispatch:
  addr = decode(&vpc) // Convert the next bytecode operation to a pointer to machine code that implements it
  // Any inter-instruction operations are performed here (e.g. updating global state, event processing, etc)
  jump addr
CODE_PTR decode(BYTE_CODE **p) {
  // In a more complex encoding, there may be multiple tables to choose between or control/mode flags
  return table[*(*p)++];
}
thread:  /* Contains bytecode, not machine addresses.  Hence it is more compact. */
  1 /*pushA*/
  2 /*pushB*/
  0 /*add*/
table:
  &add    /* table[0] = address of machine code that implements bytecode 0 */
  &pushA  /* table[1] ... */
  &pushB  /* table[2] ... */
pushA:
  *sp++ = A
  jump dispatch
pushB:
  *sp++ = B
  jump dispatch
add:
  addend1 = *--sp
  addend2 = *--sp
  *sp++ = addend1 + addend2
  jump dispatch
</syntaxhighlight>

If the virtual machine uses only byte-size instructions, <code>decode()</code> is simply a fetch from <code>thread</code>, but often there are commonly used 1-byte instructions plus some less-common multibyte instructions (see [[complex instruction set computer]]), in which case <code>decode()</code> is more complex. The decoding of single byte opcodes can be very simply and efficiently handled by a branch table using the opcode directly as an index.

For instructions where the individual operations are simple, such as "push" and "add", the [[Computational overhead|overhead]] involved in deciding what to execute is larger than the cost of actually executing it, so such interpreters are often much slower than machine code. However, for more complex ("compound") instructions, the overhead percentage is proportionally less significant.

There are times when token-threaded code can sometimes run faster than the equivalent machine code when that machine code ends up being too large to fit in the physical CPU's L1 instruction cache. The higher [[code density]] of threaded code, especially token-threaded code, can allow it to fit entirely in the L1 cache when it otherwise would not have, thereby avoiding cache thrashing. However, threaded code consumes both instruction cache (for the implementation of each operation) as well as data cache (for the bytecode and tables) unlike machine code which only consumes instruction cache; this means threaded code will eat into the budget for the amount of data that can be held for processing by the CPU at any given time. In any case, if the problem being computed involves applying a large number of operations to a small amount of data then using threaded code may be an ideal optimization.
<ref name="heller"/>