Editing Assembly language (section)

===Assembly language===
A program written in assembly language consists of a series of [[mnemonic]] processor instructions and meta-statements (known variously as declarative operations, directives, pseudo-instructions, pseudo-operations and pseudo-ops), comments and data. Assembly language instructions usually consist of an [[opcode]] mnemonic followed by an [[Operand#Computer science|operand]], which might be a list of data, arguments or parameters.<ref name="Intel_1999"/>  Some instructions may be "implied", which means the data upon which the instruction operates is implicitly defined by the instruction itself—such an instruction does not take an operand.  The resulting statement is translated by an [[assembly language assembler|assembler]] into [[machine language]] instructions that can be loaded into memory and executed.

For example, the instruction below tells an [[x86]]/[[IA-32]] processor to move an [[Constant (computer programming)|immediate 8-bit value]] into a [[processor register|register]]. The [[binary code]] for this instruction is 10110 followed by a 3-bit identifier for which register to use. The identifier for the ''AL'' register is 000, so the following [[machine code]] loads the ''AL'' register with the data 01100001.<ref name="Intel_1999"/>
 10110000 01100001
This binary computer code can be made more human-readable by expressing it in [[hexadecimal]] as follows.
 B0 61
Here, <code>B0</code> means "Move a copy of the following value into ''AL''", and <code>61</code> is a hexadecimal representation of the value 01100001, which is 97 in [[decimal]]. Assembly language for the 8086 family provides the [[mnemonic]] [[MOV (x86 instruction)|MOV]] (an abbreviation of ''move'') for instructions such as this, so the machine code above can be written as follows in assembly language, complete with an explanatory comment if required, after the semicolon. This is much easier to read and to remember.
<syntaxhighlight lang="nasm">MOV AL, 61h       ; Load AL with 97 decimal (61 hex)</syntaxhighlight>

In some assembly languages (including this one) the same mnemonic, such as MOV, may be used for a family of related instructions for loading, copying and moving data, whether these are immediate values, values in registers, or memory locations pointed to by values in registers or by immediate (a.k.a. direct) addresses.  Other assemblers may use separate opcode mnemonics such as L for "move memory to register", ST for "move register to memory", LR for "move register to register", MVI for "move immediate operand to memory", etc.

If the same mnemonic is used for different instructions, that means that the mnemonic corresponds to several different binary instruction codes, excluding data (e.g. the <code>61h</code> in this example), depending on the operands that follow the mnemonic.  For example, for the x86/IA-32 CPUs, the Intel assembly language syntax <code>MOV AL, AH</code> represents an instruction that moves the contents of register ''AH'' into register ''AL''. The<ref group="nb" name="NB3"/> hexadecimal form of this instruction is:
 88 E0
The first byte, 88h, identifies a move between a byte-sized register and either another register or memory, and the second byte, E0h, is encoded (with three bit-fields) to specify that both operands are registers, the source is ''AH'', and the destination is ''AL''.

In a case like this where the same mnemonic can represent more than one binary instruction, the assembler determines which instruction to generate by examining the operands.  In the first example, the operand <code>61h</code> is a valid hexadecimal numeric constant and is not a valid register name, so only the <code>B0</code> instruction can be applicable.  In the second example, the operand <code>AH</code> is a valid register name and not a valid numeric constant (hexadecimal, decimal, octal, or binary), so only the <code>88</code> instruction can be applicable.

Assembly languages are always designed so that this sort of lack of ambiguity is universally enforced by their syntax.  For example, in the Intel x86 assembly language, a hexadecimal constant must start with a numeral digit, so that the hexadecimal number 'A' (equal to decimal ten) would be written as <code>0Ah</code> or <code>0AH</code>, not <code>AH</code>, specifically so that it cannot appear to be the name of register ''AH''.  (The same rule also prevents ambiguity with the names of registers ''BH'', ''CH'', and ''DH'', as well as with any user-defined symbol that ends with the letter ''H'' and otherwise contains only characters that are hexadecimal digits, such as the word "BEACH".)

Returning to the original example, while the x86 opcode 10110000 (<code>B0</code>) copies an 8-bit value into the ''AL'' register, 10110001 (<code>B1</code>) moves it into ''CL'' and 10110010 (<code>B2</code>) does so into ''DL''. Assembly language examples for these follow.<ref name="Intel_1999"/>
<syntaxhighlight lang="nasm">
MOV AL, 1h        ; Load AL with immediate value 1
MOV CL, 2h        ; Load CL with immediate value 2
MOV DL, 3h        ; Load DL with immediate value 3
</syntaxhighlight>
The syntax of MOV can also be more complex as the following examples show.<ref name="Evans_2006"/>
<syntaxhighlight lang="nasm">
MOV EAX, [EBX]	  ; Move the 4 bytes in memory at the address contained in EBX into EAX
MOV [ESI+EAX], CL ; Move the contents of CL into the byte at address ESI+EAX
MOV DS, DX        ; Move the contents of DX into segment register DS
</syntaxhighlight>
<!-- The MOV to/from segment register opcodes are included below, so an example involving a segment register should be included. -->
In each case, the MOV mnemonic is translated directly into one of the opcodes 88-8C, 8E, A0-A3, B0-BF, C6 or C7 by an assembler, and the programmer normally does not have to know or remember which.<ref name="Intel_1999"/>

Transforming assembly language into machine code is the job of an assembler, and the reverse can at least partially be achieved by a [[disassembler]]. Unlike [[high-level programming language|high-level languages]], there is a [[bijection|one-to-one correspondence]] between many simple assembly statements and machine language instructions. However, in some cases, an assembler may provide ''pseudoinstructions'' (essentially macros) which expand into several machine language instructions to provide commonly needed functionality. For example, for a machine that lacks a "branch if greater or equal" instruction, an assembler may provide a pseudoinstruction that expands to the machine's "set if less than" and "branch if zero (on the result of the set instruction)". Most full-featured assemblers also provide a rich [[macro (computer science)|macro]] language (discussed below) which is used by vendors and programmers to generate more complex code and data sequences. Since the information about pseudoinstructions and macros defined in the assembler environment is not present in the object program, a disassembler cannot reconstruct the macro and pseudoinstruction invocations but can only disassemble the actual machine instructions that the assembler generated from those abstract assembly-language entities. Likewise, since comments in the assembly language source file are ignored by the assembler and have no effect on the object code it generates, a disassembler is always completely unable to recover source comments.

Each [[computer architecture]] has its own machine language.  Computers differ in the number and type of operations they support, in the different sizes and numbers of registers, and in the representations of data in storage. While most general-purpose computers are able to carry out essentially the same functionality, the ways they do so differ; the corresponding assembly languages reflect these differences.

Multiple sets of [[mnemonic]]s or assembly-language syntax may exist for a single instruction set, typically instantiated in different assembler programs. In these cases, the most popular one is usually that supplied by the CPU manufacturer and used in its documentation.

Two examples of CPUs that have two different sets of mnemonics are the Intel 8080 family and the Intel 8086/8088.  Because Intel claimed copyright on its assembly language mnemonics (on each page of their documentation published in the 1970s and early 1980s, at least), some companies that independently produced CPUs compatible with Intel instruction sets invented their own mnemonics.  The [[Zilog Z80]] CPU, an enhancement of the [[Intel 8080A]], supports all the 8080A instructions plus many more; Zilog invented an entirely new assembly language, not only for the new instructions but also for all of the 8080A instructions.  For example, where Intel uses the mnemonics ''MOV'', ''MVI'', ''LDA'', ''STA'', ''LXI'', ''LDAX'', ''STAX'', ''LHLD'', and ''SHLD'' for various data transfer instructions, the Z80 assembly language uses the mnemonic ''LD'' for all of them.  A similar case is the [[NEC V20]] and [[NEC V30|V30]] CPUs, enhanced copies of the Intel 8086 and 8088, respectively.  Like Zilog with the Z80, NEC invented new mnemonics for all of the 8086 and 8088 instructions, to avoid accusations of infringement of Intel's copyright.  (It is questionable whether such copyrights can be valid, and later CPU companies such as [[AMD]]<ref group="nb" name="NB1"/> and [[Cyrix]] republished Intel's x86/IA-32 instruction mnemonics exactly with neither permission nor legal penalty.)  It is doubtful whether in practice many people who programmed the V20 and V30 actually wrote in NEC's assembly language rather than Intel's; since any two assembly languages for the same instruction set architecture are isomorphic (somewhat like English and [[Pig Latin]]), there is no requirement to use a manufacturer's own published assembly language with that manufacturer's products.