Editing Prefetch input queue

{{Short description|CPU optimization unit}}
{{Use American English|date = March 2019}}

Fetching the instruction [[opcode]]s from program [[Computer memory|memory]] well in advance is known as [[Instruction prefetch|prefetching]] and it is served by using a '''prefetch input queue''' (PIQ). The pre-fetched instructions are stored in a [[FIFO (computing and electronics)|queue]]. The fetching of opcodes well in advance, prior to their need for execution, increases the overall efficiency of the [[Microprocessor|processor]] boosting its speed. The processor no longer has to wait for the memory access operations for the subsequent instruction opcode to complete. This architecture was prominently used in the [[Intel 8086|Intel 8086 microprocessor]].

==Introduction==

[[Pipeline (computing)|Pipelining]] was brought to the forefront of computing architecture design during the 1960s due to the need for faster and more efficient computing. Pipelining is the broader concept and most modern processors load their instructions some [[clock cycle]]s before they execute them. This is achieved by pre-loading [[machine code]] from memory into a  ''prefetch input queue''.

This behavior{{clarify|date=June 2020 |reason=Unclear what behavior is being referenced. I think a previous revision referred to modification of instructions in the queue, but now this paragraph doesn't make sense.}} only applies to [[von Neumann computer]]s (that is, not [[Harvard architecture]] computers) that can run [[self-modifying code]] and have some sort of [[instruction pipelining]]. Nearly all modern high-performance computers fulfill these three requirements.<ref>{{cite web|title=ARM Information Center|url=http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka3839.html|work=ARM Technical Support Knowledge Articles}}</ref>

Usually, the prefetching behavior of the PIQ is invisible to the [[programming model]] of the CPU. However, there are some circumstances where the behavior of PIQ is visible, and needs to be taken into account by the programmer.

When an [[x86]] processor changes mode from [[real mode]] to [[protected mode]] and vice versa, the PIQ has to be flushed, or else the CPU will continue to translate the [[machine code]] as if it were written in its last mode. If the PIQ is not flushed, the processor might translate its codes wrong and generate an invalid instruction [[Exception handling|exception]].

When executing [[self-modifying code]], a change in the processor code immediately in front of the current location of execution might not change how the processor interprets the code, as it is already loaded into its PIQ. It simply executes its old copy already loaded in the PIQ instead of the new and altered version of the code in its [[random access memory|RAM]] and/or [[CPU cache|cache]].

This behavior of the PIQ can be used to determine if code is being executed inside an [[emulator]] or directly on the hardware of a real CPU.{{fact|date=May 2013}} Most emulators will ''probably'' never simulate this behavior. If the PIQ-size is zero (changes in the code ''always'' affect the state of the processor immediately), it can be deduced that either the code is being executed in an emulator or the processor invalidates the PIQ upon writes to addresses loaded in the PIQ.

==Performance evaluation based on queuing theory==
It was [[Agner Krarup Erlang|A.K Erlang]] (1878-1929) who first conceived of a queue as a solution to congestion in telephone traffic. Different [[Queueing models|queueing model]]s are proposed in order to approximately simulate the real time queuing systems so that those can be analysed mathematically for different performance specifications.

Queuing models can be represented using [[Kendall's notation]]:

:A1/A2/A3/A4

where:

* A1 is the distribution of time between two arrivals
* A2 is the service time distribution
* A3 is the total number of servers
* A4 is the capacity of system

# '''M/M/1 Model''' (Single Queue Single Server/ [[Markov process|Markovian]]): In this model, elements of queue are served on a first-come, first-served basis. Given the mean arrival and service rates, then actual rates vary around these average values randomly and hence have to be determined using a [[Cumulative distribution function|cumulative probability distribution function]].<ref>{{cite book|last=Hayes|first=John|title=Computer Architecture and Organization|year=1998|publisher=McGraw-Hill|edition=Second}}</ref>
# '''M/M/r Model''': This model is a generalization of the basic M/M/1 model where multiple servers operate in parallel. This kind of model can also model scenarios with impatient users who leave the queue immediately if they are not receiving service. This can also be modeled using a [[Bernoulli process]] having only two states, success and failure. The best example of this model is our regular land-line telephone systems.<ref>{{cite book|last=Feller|first=William|title=An Introduction to Probability theory and its applications|year=1968|publisher=John Wiley and Sons|edition=Second}}</ref>
# '''M/G/1 Model''' (Takacs' finite input Model) :  This model is used to analyze advanced cases. Here the service time distribution is no longer a [[Markov process]]. This model considers the case of more than one failed machine being repaired by single repairman. Service time for any user is going to increase in this case.<ref>{{cite book|last=Papoulis|first=Athanasios|title=Probability, Random Variables and Stochastic Processes|year=2008|publisher=McGraw-Hill|pages=784 to 800|edition=Fourth|author2=S.Unnikrishna Pillai}}</ref>

Generally in applications like prefetch input queue, M/M/1 Model is popularly used because of limited use of queue features. In this model in accordance with microprocessors, the user takes the role of the execution unit and server is the bus interface unit.

==Instruction queue==
The processor executes a program by fetching the instructions from memory and executing them. Usually the processor execution speed is much faster than the memory access speed. Instruction queue is used to prefetch the next instructions in a separate buffer while the processor is executing the current instruction.

With a [[Instruction pipeline|four stage pipeline]], the rate at which instructions are executed can be up to four times that of sequential execution.<ref>{{cite book|last=Zaky|first=Safwat|title=Computer Organization|year=1996|publisher=McGraw-Hill|isbn=0-07-114309-2|pages=[https://archive.org/details/isbn_9780071143097/page/310 310–329]|edition=Fourth|author2=V. Carl Hamacher|author3=Zvonko G. Vranesic|url-access=registration|url=https://archive.org/details/isbn_9780071143097/page/310}}</ref>

The processor usually has two separate units for fetching the instructions and for executing the instructions.<ref>{{cite web|title=Block diagram of 8086 CPU|url=http://www.compeng.dit.ie/staff/tscarff/BIU/bus_interface_unit.htm}}</ref><ref>{{cite book|last=Hall|first=Douglas|title=Microprocessors and Interfacing|year=2006|publisher=Tata McGraw-Hill|isbn=0-07-060167-4|pages=2.12}}</ref>

The implementation of a [[Pipeline (computing)|pipeline]] architecture is possible only if the bus interface unit and the execution unit are independent. While the execution unit is decoding or executing an instruction which does not require the use of the [[System bus|data]] and [[address bus]]es, the bus interface unit fetches [[Opcode|instruction opcodes]] from the memory.

This process is much faster than sending out an address, reading the opcode and then decoding and executing it. Fetching the next instruction while the current instruction is being decoded or executed is called pipelining.<ref>{{cite book|last=Hall|first=Douglas|title=Microprocessors and Interfacing|year=2006|publisher=Tata McGraw-Hill|location=New Delhi|isbn=0-07-060167-4|pages=2.13–2.14}}</ref>

The [[Intel 8086|8086]] processor has a six-byte prefetch instruction pipeline, while the [[Intel 8088|8088]] has a four-byte prefetch. As the Execution Unit is executing the current instruction, the bus interface unit reads up to six (or four) bytes of opcodes in advance from the memory. The queue lengths were chosen based on simulation studies.<ref>{{cite journal |last1=McKevitt |first1=James |last2=Bayliss |first2=John |title=New options from big chips |journal=IEEE Spectrum |date=March 1979 |volume=16 |issue=3 |pages=28–34|doi=10.1109/MSPEC.1979.6367944 |s2cid=25154920 }}</ref>

An exception is encountered when the execution unit encounters a [[Branch (computer science)|branch]] instruction i.e. either a jump or a call instruction. In this case, the entire queue must be dumped and the contents pointed to by the instruction pointer must be fetched from memory.

==Drawbacks==

Processors implementing the instruction queue prefetch algorithm are rather technically advanced. The [[CPU design]] level complexity of the such processors is much higher than for regular processors. This is primarily because of the need to implement two separate units, the [[Bus (computing)|BIU]] and [[Execution unit|EU]], operating separately.

As the complexity of these chips increases, the cost also increases. These processors are relatively costlier than their counterparts without the prefetch input queue.

However, these disadvantages are greatly offset by the improvement in processor execution time. After the introduction of prefetch instruction queue in the 8086 processor, all successive processors have incorporated this feature.

==x86 example code==
<syntaxhighlight lang=nasm>
code_starts_here:
  mov bx, ahead
  mov word ptr cs:[bx], 9090h
ahead:
  jmp near to_the_end
  ; Some other code
to_the_end:
</syntaxhighlight>

This [[self-modifying code|self-modifying]] program will overwrite the ''jmp to_the_end'' with two [[NOP (code)|NOP]]s (which is encoded as ''0x9090''). The jump ''jmp near to_the_end'' is assembled into two bytes of machine code, so the two NOPs will just overwrite this jump and nothing else. (That is, the jump is replaced with a do-nothing-code.)

Because the machine code of the jump is already read into the PIQ, and probably also already executed by the processor ([[superscalar]] processors execute several instructions at once, but they "pretend" that they don't because of the need for [[backward compatibility]]), the change of the code will not have any change of the execution flow.

===Example program to detect size===
This is an example [[NASM (computer program)|NASM]]-[[syntax]] [[Self-modifying code|self-modifying]] [[x86]]-[[assembly language]] algorithm that determines the size of the PIQ:

<syntaxhighlight lang=nasm>
code_starts_here:
  xor bx, bx                  ; zero register bx
  xor ax, ax                  ; zero register ax

  mov dx, cs
  mov [code_segment], dx      ; "calculate" codeseg in the far jump below (edx here too)

around:
  cmp ax, 1                   ; check if ax has been altered
  je found_size
                              ; 0x90 = opcode "nop" (NO oPeration)
  mov byte [nop_field+bx], 0x90
  inc bx

  db 0xEA                     ; 0xEA = opcode "far jump"
  dw flush_queue              ; should be followed by offset (rm = "dw", pm = "dd")
code_segment:
  dw 0                        ; and then the code segment (calculated above)
flush_queue:
                              ; 0x40 = opcode "inc ax" (INCrease ax)
  mov byte [nop_field+bx], 0x40
nop_field:
  times 256 nop 
  jmp around
found_size:
  ;
  ;    register bx now contains the size of the PIQ
  ;    this code is for [[real mode]] and [[16-bit protected mode]], but it could easily be changed into
  ;    running for [[32-bit protected mode]] as well. just change the "dw" for
  ;    the offset to "dd". you need also change dx to edx at the top as
  ;    well. (dw and dx = 16 bit addressing, dd and edx = 32 bit addressing)
  ;
</syntaxhighlight>

What this code does is basically that it changes the execution flow, and determines by [[brute-force search|brute force]] how large the PIQ is. "How far away do I have to change the code in front of me for it to affect me?"
If it is too near (it is already in the PIQ) the update will not have any effect. If it is far enough, the change of the code will affect the program and the program has then found the size of the processor's PIQ.
If this code is being executed under multitasking OS, the [[context switch]] may lead to the wrong value.

==See also==
* [[Sequence point]]

==References==
{{reflist}}

==External links==
<br />{{Processor technologies}}
{{DEFAULTSORT:Prefetch Input Queue}}
[[Category:Instruction processing]]