Editing X86 assembly language (section)

==Instruction types==
In general, the features of the modern [[x86 instruction set]] are:

* A compact encoding
** Variable length and alignment independent (encoded as [[Endianness|little endian]], as is all data in the x86 architecture)
** Mainly one-address and two-address instructions, that is to say, the first [[operand]] is also the destination.
** Memory operands as both source and destination are supported (frequently used to read/write stack elements addressed using small immediate offsets).
** Both general and implicit [[register (computing)|register]] usage; although all seven (counting <code>ebp</code>) general registers in 32-bit mode, and all fifteen (counting <code>rbp</code>) general registers in 64-bit mode, can be freely used as [[accumulator (computing)|accumulator]]s or for addressing, most of them are also ''implicitly'' used by certain (more or less) special instructions; affected registers must therefore be temporarily preserved (normally stacked), if active during such instruction sequences.
* Produces conditional flags implicitly through most integer [[Arithmetic logic unit|ALU]] instructions.
* Supports various [[addressing mode]]s including immediate, offset, and scaled index but not PC-relative, except jumps (introduced as an improvement in the [[x86-64]] architecture).
* Includes [[floating-point arithmetic|floating point]] to a stack of registers.
* Contains special support for atomic [[read-modify-write]] instructions (<code>xchg</code>, <code>cmpxchg</code>/<code>cmpxchg8b</code>, <code>xadd</code>, and integer instructions which combine with the <code>lock</code> prefix)
* [[SIMD]] instructions (instructions which perform parallel simultaneous single instructions on many operands encoded in adjacent cells of wider registers).

===Stack instructions===
The x86 architecture has hardware support for an [[Call stack|execution stack]] mechanism. Instructions such as <code>push</code>, <code>pop</code>, <code>call</code> and <code>ret</code> are used with the properly set up stack to pass parameters, to allocate space for local data, and to save and restore call-return points. The <code>ret</code> ''size'' instruction is very useful for implementing space efficient (and fast) [[calling convention]]s where the callee is responsible for reclaiming stack space occupied by parameters.

When setting up a [[stack frame]] to hold local data of a [[recursion (computer science)|recursive procedure]] there are several choices; the high level <code>enter</code> instruction (introduced with the 80186) takes a ''procedure-nesting-depth'' argument as well as a ''local size'' argument, and ''may'' be faster than more explicit manipulation of the registers (such as <code>push bp</code> ; <code>mov bp, sp</code> ; <code>sub sp, ''size''</code>).  Whether it is faster or slower depends on the particular x86-processor implementation as well as the calling convention used by the compiler, programmer or particular program code; most x86 code is intended to run on x86-processors from several manufacturers and on different technological generations of processors, which implies highly varying [[microarchitecture]]s and [[microcode]] solutions as well as varying [[logic gate|gate]]- and [[transistor]]-level design choices.

The full range of addressing modes (including ''immediate'' and ''base+offset'') even for instructions such as <code>push</code> and <code>pop</code>, makes direct usage of the stack for [[integer]], [[floating-point arithmetic|floating point]] and [[memory address|address]] data simple, as well as keeping the [[Application binary interface|ABI]] specifications and mechanisms relatively simple compared to some RISC architectures (require more explicit call stack details).

===Integer ALU instructions===
x86 assembly has the standard mathematical operations, <code>add</code>, <code>sub</code>, <code>neg</code>, <code>imul</code> and <code>idiv</code> (for signed integers), with <code>mul</code> and <code>div</code> (for unsigned integers); the [[logical operator]]s <code>and</code>, <code>or</code>, <code>xor</code>, <code>not</code>; [[bitshift]] arithmetic and logical, <code>sal</code>/<code>sar</code> (for signed integers), <code>shl</code>/<code>shr</code> (for unsigned integers); rotate with and without carry, <code>rcl</code>/<code>rcr</code>, <code>rol</code>/<code>ror</code>, a complement of [[Binary-coded decimal|BCD]] arithmetic instructions, <code>aaa</code>, <code>aad</code>, <code>daa</code> and others.

===Floating-point instructions===
x86 assembly language includes instructions for a stack-based floating-point unit (FPU). The FPU was an optional separate coprocessor for the 8086 through the 80386, it was an on-chip option for the 80486 series, and it is a standard feature in every Intel x86 CPU since the 80486, starting with the Pentium. The FPU instructions include addition, subtraction, negation, multiplication, division, remainder, square roots, integer truncation, fraction truncation, and scale by power of two. The operations also include conversion instructions, which can load or store a value from memory in any of the following formats: binary-coded decimal, 32-bit integer, 64-bit integer, 32-bit floating-point, 64-bit floating-point or 80-bit floating-point (upon loading, the value is converted to the currently used floating-point mode). x86 also includes a number of [[transcendental function]]s, including sine, cosine, tangent, arctangent, exponentiation with the base 2 and logarithms to bases 2, 10, or [[E (mathematical constant)|''e'']].

The stack register to stack register format of the instructions is usually <code>f''op'' st, st(''n'')</code> or <code>f''op'' st(''n''), st</code>, where <code>st</code> is equivalent to <code>st(0)</code>, and <code>st(''n'')</code> is one of the 8 stack registers (<code>st(0)</code>, <code>st(1)</code>, ..., <code>st(7)</code>). Like the integers, the first operand is both the first source operand and the destination operand. <code>fsubr</code> and <code>fdivr</code> should be singled out as first swapping the source operands before performing the subtraction or division. The addition, subtraction, multiplication, division, store and comparison instructions include instruction modes that pop the top of the stack after their operation is complete. So, for example, <code>faddp st(1), st</code> performs the calculation <code>st(1) = st(1) + st(0)</code>, then removes <code>st(0)</code> from the top of stack, thus making what was the result in <code>st(1)</code> the top of the stack in <code>st(0)</code>.

===SIMD instructions===
Modern x86 CPUs contain [[Single instruction, multiple data|SIMD]] instructions, which largely perform the same operation in parallel on many values encoded in a wide SIMD register. Various instruction technologies support different operations on different register sets, but taken as complete whole (from [[MMX (instruction set)|MMX]] to [[SSE4#SSE4.2|SSE4.2]]) they include general computations on integer or floating-point arithmetic (addition, subtraction, multiplication, shift, minimization, maximization, comparison, division or square root). So for example, <code>paddw mm0, mm1</code> performs 4 parallel 16-bit (indicated by the <code>w</code>) integer adds (indicated by the <code>padd</code>) of <code>mm0</code> values to <code>mm1</code> and stores the result in <code>mm0</code>. [[Streaming SIMD Extensions]] or SSE also includes a floating-point mode in which only the very first value of the registers is actually modified (expanded in [[SSE2]]). Some other unusual instructions have been added including a [[sum of absolute differences]] (used for [[motion estimation]] in [[video compression]], such as is done in [[MPEG]]) and a 16-bit multiply accumulation instruction (useful for software-based alpha-blending and [[digital filter]]ing). SSE (since [[SSE3]]) and [[3DNow!]] extensions include addition and subtraction instructions for treating paired floating-point values like complex numbers.

These instruction sets also include numerous fixed sub-word instructions for shuffling, inserting and extracting the values around within the registers. In addition there are instructions for moving data between the integer registers and XMM (used in SSE)/FPU (used in MMX) registers.

=== {{Anchor|Data manipulation instructions}}Memory instructions ===
The x86 processor also includes complex addressing modes for addressing memory with an immediate offset, a register, a register with an offset, a scaled register with or without an offset, and a register with an optional offset and another scaled register. So for example, one can encode <code>mov eax, [Table + ebx + esi*4]</code> as a single instruction which loads 32 bits of data from the address computed as <code>(Table + ebx + esi * 4)</code> offset from the <code>ds</code> selector, and stores it to the <code>eax</code> register. In general x86 processors can load and use memory matched to the size of any register it is operating on. (The SIMD instructions also include half-load instructions.)

Most 2-operand x86 instructions, including integer ALU instructions,
use a standard "[[addressing mode]] byte"<ref>
Curtis Meadow.
[http://aturing.umcs.maine.edu/~meadow/courses/cos335/8086-instformat.pdf "Encoding of 8086 Instructions"].
</ref>
often called the [[modR/M|MOD-REG-R/M byte]].<ref>
Igor Kholodov.
[http://www.c-jump.com/CIS77/CPU/x86/X77_0060_mod_reg_r_m_byte.htm "6. Encoding x86 Instruction Operands, MOD-REG-R/M Byte"].
</ref><ref>

[http://www.cs.loyola.edu/~binkley/371/Encoding_Real_x86_Instructions.html "Encoding x86 Instructions"].
</ref><ref>
Michael Abrash.
"Zen of Assembly Language: Volume I, Knowledge".
"Chapter 7: Memory Addressing".
Section [http://www.jagregory.com/abrash-zen-of-asm/ "mod-reg-rm Addressing"].
</ref>
Many 32-bit x86 instructions also have a [[ModR/M|SIB addressing mode byte]] that follows the MOD-REG-R/M byte.<ref>
Intel 80386 Reference Programmer's Manual.
[https://www.scs.stanford.edu/05au-cs240c/lab/i386/s17_02.htm "17.2.1 ModR/M and SIB Bytes"]
</ref><ref>
[https://wiki.osdev.org/X86-64_Instruction_Encoding#ModR.2FM_and_SIB_bytes "X86-64 Instruction Encoding: ModR/M and SIB bytes"]
</ref><ref>
[http://uglyduck.vajn.icu/PDF/Intel/OPx86.pdf "Figure 2-1. Intel 64 and IA-32 Architectures Instruction Format"].
</ref><ref>
[https://paul.bone.id.au/blog/2018/09/26/more-x86-addressing/ "x86 Addressing Under the Hood"].
</ref><ref name="mccamant" >
Stephen McCamant.
[https://www-users.cse.umn.edu/~smccaman/courses/8980/spring2020/lectures/01-x86-intro-arith-8up.pdf "Manual and Automated Binary Reverse Engineering"].
</ref>

In principle, because the instruction opcode is separate from the addressing mode byte, those instructions are [[orthogonal instruction set|orthogonal]] because any of those opcodes can be mixed-and-matched with any addressing mode.
However, the x86 instruction set is generally considered non-orthogonal because many other opcodes have some fixed addressing mode (they have no addressing mode byte), and every register is special.<ref name="mccamant" /><ref>
[https://locklessinc.com/articles/instruction_wishlist/ "X86 Instruction Wishlist"].
</ref>

The x86 instruction set includes string load, store, move, scan and compare instructions (<code>lods</code>, <code>stos</code>, <code>movs</code>, <code>scas</code> and <code>cmps</code>) which perform each operation to a specified size (<code>b</code> for 8-bit byte, <code>w</code> for 16-bit word, <code>d</code> for 32-bit double word) then increments/decrements (depending on DF, direction flag) the implicit address register (<code>si</code> for <code>lods</code>, <code>di</code> for <code>stos</code> and <code>scas</code>, and both for <code>movs</code> and <code>cmps</code>). For the load, store and scan operations, the implicit target/source/comparison register is in the <code>al</code>, <code>ax</code> or <code>eax</code> register (depending on size). The implicit segment registers used are <code>ds</code> for <code>si</code> and <code>es</code> for <code>di</code>. The <code>cx</code> or <code>ecx</code> register is used as a decrementing counter, and the operation stops when the counter reaches zero or (for scans and comparisons) when inequality is detected. Unfortunately, over the years the performance of some of these instructions became neglected and in certain cases it is now possible to get faster results by writing out the algorithms yourself. Intel and AMD have refreshed some of the instructions though, and a few now have very respectable performance, so it is recommended that the programmer should read recent respected benchmark articles before choosing to use a particular instruction from this group.

The stack is a region of memory and an associated ‘stack pointer’, which points to the bottom of the stack. The stack pointer is decremented when items are added (‘push’) and incremented after things are removed (‘pop’). In 16-bit mode, this implicit stack pointer is addressed as SS:[SP], in 32-bit mode it is SS:[ESP], and in 64-bit mode it is [RSP]. The stack pointer actually points to the last value that was stored, under the assumption that its size will match the operating mode of the processor (i.e., 16, 32, or 64 bits) to match the default width of the <code>push</code>/<code>pop</code>/<code>call</code>/<code>ret</code> instructions. Also included are the instructions <code>enter</code> and <code>leave</code> which reserve and remove data from the top of the stack while setting up a stack frame pointer in <code>bp</code>/<code>ebp</code>/<code>rbp</code>. However, direct setting, or addition and subtraction to the <code>sp</code>/<code>esp</code>/<code>rsp</code> register is also supported, so the <code>enter</code>/<code>leave</code> instructions are generally unnecessary.

This code is the beginning of a function typical for a high-level language when compiler optimisation is turned off for ease of debugging:
<syntaxhighlight lang="nasm">
 push    rbp       ; Save the calling function’s stack frame pointer (rbp register)
 mov     rbp, rsp  ; Make a new stack frame below our caller’s stack
 sub     rsp, 32   ; Reserve 32 bytes of stack space for this function’s local variables.
                   ; Local variables will be below rbp and can be referenced relative to rbp,
                   ; again best for ease of debugging, but for best performance rbp will not
                   ; be used at all, and local variables would be referenced relative to rsp
                   ; because, apart from the code saving, rbp then is free for other uses.
  …       …        ; However, if rbp is altered here, its value should be preserved for the caller.
 mov [rbp-8], rdx  ; Example of writing to a local variable (by its memory location) from register rdx
</syntaxhighlight>
...is functionally equivalent to just:
<syntaxhighlight lang="nasm"> enter   32, 0</syntaxhighlight>

Other instructions for manipulating the stack include <code>pushfd</code>(32-bit) / <code>pushfq</code>(64-bit) and <code>popfd/popfq</code> for storing and retrieving the EFLAGS (32-bit) / RFLAGS (64-bit) register.

Values for a SIMD load or store are assumed to be packed in adjacent positions for the SIMD register and will align them in sequential little-endian order. Some SSE load and store instructions require 16-byte alignment to function properly. The SIMD instruction sets also include "prefetch" instructions which perform the load but do not target any register, used for cache loading. The SSE instruction sets also include non-temporal store instructions which will perform stores straight to memory without performing a cache allocate if the destination is not already cached (otherwise it will behave like a regular store.)

Most generic integer and floating-point (but no SIMD) instructions can use one parameter as a complex address as the second source parameter. Integer instructions can also accept one memory parameter as a destination operand.