Editing Assembly language (section)

==Key concepts==
===Assembler===<!-- This section is linked from [[Computer software]] -->
An '''assembler''' program creates [[object code]] by [[translator (computing)|translating]] combinations of [[mnemonic]]s and [[Syntax (programming languages)|syntax]] for operations and addressing modes into their numerical equivalents. This representation typically includes an ''operation code'' ("[[opcode]]") as well as other control [[bit]]s and data. The assembler also calculates constant expressions and resolves [[identifier|symbolic names]] for memory locations and other entities.<ref name="Salomon_1992"/> The use of symbolic references is a key feature of assemblers, saving tedious calculations and manual address updates after program modifications. Most assemblers also include [[Macro (computer science)|macro]] facilities for performing textual substitution – e.g., to generate common short sequences of instructions as [[inline expansion|inline]], instead of ''called'' [[subroutine]]s.

Some assemblers may also be able to perform some simple types of [[instruction set architecture|instruction set]]-specific [[compiler optimization|optimization]]s. One concrete example of this may be the ubiquitous [[x86]] assemblers from various vendors. Called [[jump-sizing]],<ref name="Salomon_1992"/> most of them are able to perform jump-instruction replacements (long jumps replaced by short or relative jumps) in any number of passes, on request. Others may even do simple rearrangement or insertion of instructions, such as some assemblers for [[RISC architectures]] that can help optimize a sensible [[instruction scheduling]] to exploit the [[CPU pipeline]] as efficiently as possible.<ref>{{cite conference |url=https://www.researchgate.net/publication/262389375 |doi=10.1145/2465554.2465559 |title=Improving processor efficiency by statically pipelining instructions |book-title=Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems |year=2013 |last1=Finlayson |first1=Ian |last2=Davis |first2=Brandon |last3=Gavin |first3=Peter |last4=Uh |first4=Gang-Ryung |last5=Whalley |first5=David |last6=Själander |first6=Magnus |last7=Tyson |first7=Gary |pages=33–44 |isbn=9781450320856 |s2cid=8015812}}</ref>

Assemblers have been available since the 1950s, as the first step above machine language and before [[high-level programming language]]s such as [[Fortran]], [[ALGOL|Algol]], [[COBOL]] and [[Lisp (programming language)|Lisp]]. There have also been several classes of translators and semi-automatic [[code generation (compiler)|code generators]] with properties similar to both assembly and high-level languages, with [[Speedcode]] as perhaps one of the better-known examples.

There may be several assemblers with different [[Syntax (programming languages)|syntax]] for a particular [[Central processing unit|CPU]] or [[instruction set architecture]]. For instance, an instruction to add memory data to a register in a [[x86]]-family processor might be <code>add eax,[ebx]</code>, in original ''[[Intel syntax]]'', whereas this would be written <code>addl (%ebx),%eax</code> in the ''[[AT&T syntax]]'' used by the [[GNU Assembler]]. Despite different appearances, different syntactic forms generally generate the same numeric [[machine code]]. A single assembler may also have different modes in order to support variations in syntactic forms as well as their exact semantic interpretations (such as [[FASM]]-syntax, [[TASM]]-syntax, ideal mode, etc., in the special case of [[x86 assembly language|x86 assembly]] programming).

==== {{Anchor|Two-pass assembler}} Number of passes====
There are two types of assemblers based on how many passes through the source are needed (how many times the assembler reads the source) to produce the object file.
* '''One-pass assemblers''' process the source code once.  For symbols used before they are defined, the assembler will emit [[Erratum|"errata"]] after the eventual definition, telling the [[linker (computing)|linker]] or the loader to patch the locations where the as yet undefined symbols had been used.
* '''Multi-pass assemblers''' create a table with all symbols and their values in the first passes, then use the table in later passes to generate code.
In both cases, the assembler must be able to determine the size of each instruction on the initial passes in order to calculate the addresses of subsequent symbols. This means that if the size of an operation referring to an operand defined later depends on the type or distance of the operand, the assembler will make a pessimistic estimate when first encountering the operation, and if necessary, pad it with one or more
"[[NOP (code)|no-operation]]" instructions in a later pass or the errata. In an assembler with [[peephole optimization]], addresses may be recalculated between passes to allow replacing pessimistic code with code tailored to the exact distance from the target.

The original reason for the use of one-pass assemblers was memory size and speed of assembly – often a second pass would require storing the symbol table in memory (to handle [[forward reference]]s), rewinding and rereading the program source on [[magnetic-tape data storage|tape]], or rereading a deck of [[punched card|cards]] or [[punched tape|punched paper tape]]. Later computers with much larger memories (especially disc storage), had the space to perform all necessary processing without such re-reading. The advantage of the multi-pass assembler is that the absence of errata makes the [[linker (computing)|linking process]] (or the [[loader (computing)|program load]] if the assembler directly produces executable code) faster.<ref name="Beck_1996"/>

'''Example:''' in the following code snippet, a one-pass assembler would be able to determine the address of the backward reference <var>BKWD</var> when assembling statement <var>S2</var>, but would not be able to determine the address of the forward reference <var>FWD</var> when assembling the branch statement <var>S1</var>; indeed, <var>FWD</var> may be undefined. A two-pass assembler would determine both addresses in pass 1, so they would be known when generating code in pass 2.
 {{var|S1}}   B    {{var|FWD}}
   ...
 {{var|FWD}}   EQU *
   ...
 {{var|BKWD}}  EQU *
   ...
 {{var|S2}}    B   {{var|BKWD}}

====High-level assemblers====
More sophisticated [[high-level assembler]]s provide language abstractions such as:
* High-level procedure/function declarations and invocations
* Advanced control structures (IF/THEN/ELSE, SWITCH)
* High-level abstract data types, including structures/records, unions, classes, and sets
* Sophisticated macro processing (although available on ordinary assemblers since the late 1950s for, e.g., the [[IBM 700/7000 series|IBM 700 series]] and [[IBM 700/7000 series|IBM 7000 series]], and since the 1960s for [[IBM System/360]] (S/360), amongst other machines)
* [[Object-oriented programming]] features such as [[class (computer programming)|class]]es, [[Object (computer science)|object]]s, [[Abstraction (computer science)|abstraction]], [[Polymorphism (computer science)|polymorphism]], and [[inheritance (object-oriented programming)|inheritance]]<ref name="Hyde_2003"/>
See [[#Language design|Language design]] below for more details.

===Assembly language===
A program written in assembly language consists of a series of [[mnemonic]] processor instructions and meta-statements (known variously as declarative operations, directives, pseudo-instructions, pseudo-operations and pseudo-ops), comments and data. Assembly language instructions usually consist of an [[opcode]] mnemonic followed by an [[Operand#Computer science|operand]], which might be a list of data, arguments or parameters.<ref name="Intel_1999"/>  Some instructions may be "implied", which means the data upon which the instruction operates is implicitly defined by the instruction itself—such an instruction does not take an operand.  The resulting statement is translated by an [[assembly language assembler|assembler]] into [[machine language]] instructions that can be loaded into memory and executed.

For example, the instruction below tells an [[x86]]/[[IA-32]] processor to move an [[Constant (computer programming)|immediate 8-bit value]] into a [[processor register|register]]. The [[binary code]] for this instruction is 10110 followed by a 3-bit identifier for which register to use. The identifier for the ''AL'' register is 000, so the following [[machine code]] loads the ''AL'' register with the data 01100001.<ref name="Intel_1999"/>
 10110000 01100001
This binary computer code can be made more human-readable by expressing it in [[hexadecimal]] as follows.
 B0 61
Here, <code>B0</code> means "Move a copy of the following value into ''AL''", and <code>61</code> is a hexadecimal representation of the value 01100001, which is 97 in [[decimal]]. Assembly language for the 8086 family provides the [[mnemonic]] [[MOV (x86 instruction)|MOV]] (an abbreviation of ''move'') for instructions such as this, so the machine code above can be written as follows in assembly language, complete with an explanatory comment if required, after the semicolon. This is much easier to read and to remember.
<syntaxhighlight lang="nasm">MOV AL, 61h       ; Load AL with 97 decimal (61 hex)</syntaxhighlight>

In some assembly languages (including this one) the same mnemonic, such as MOV, may be used for a family of related instructions for loading, copying and moving data, whether these are immediate values, values in registers, or memory locations pointed to by values in registers or by immediate (a.k.a. direct) addresses.  Other assemblers may use separate opcode mnemonics such as L for "move memory to register", ST for "move register to memory", LR for "move register to register", MVI for "move immediate operand to memory", etc.

If the same mnemonic is used for different instructions, that means that the mnemonic corresponds to several different binary instruction codes, excluding data (e.g. the <code>61h</code> in this example), depending on the operands that follow the mnemonic.  For example, for the x86/IA-32 CPUs, the Intel assembly language syntax <code>MOV AL, AH</code> represents an instruction that moves the contents of register ''AH'' into register ''AL''. The<ref group="nb" name="NB3"/> hexadecimal form of this instruction is:
 88 E0
The first byte, 88h, identifies a move between a byte-sized register and either another register or memory, and the second byte, E0h, is encoded (with three bit-fields) to specify that both operands are registers, the source is ''AH'', and the destination is ''AL''.

In a case like this where the same mnemonic can represent more than one binary instruction, the assembler determines which instruction to generate by examining the operands.  In the first example, the operand <code>61h</code> is a valid hexadecimal numeric constant and is not a valid register name, so only the <code>B0</code> instruction can be applicable.  In the second example, the operand <code>AH</code> is a valid register name and not a valid numeric constant (hexadecimal, decimal, octal, or binary), so only the <code>88</code> instruction can be applicable.

Assembly languages are always designed so that this sort of lack of ambiguity is universally enforced by their syntax.  For example, in the Intel x86 assembly language, a hexadecimal constant must start with a numeral digit, so that the hexadecimal number 'A' (equal to decimal ten) would be written as <code>0Ah</code> or <code>0AH</code>, not <code>AH</code>, specifically so that it cannot appear to be the name of register ''AH''.  (The same rule also prevents ambiguity with the names of registers ''BH'', ''CH'', and ''DH'', as well as with any user-defined symbol that ends with the letter ''H'' and otherwise contains only characters that are hexadecimal digits, such as the word "BEACH".)

Returning to the original example, while the x86 opcode 10110000 (<code>B0</code>) copies an 8-bit value into the ''AL'' register, 10110001 (<code>B1</code>) moves it into ''CL'' and 10110010 (<code>B2</code>) does so into ''DL''. Assembly language examples for these follow.<ref name="Intel_1999"/>
<syntaxhighlight lang="nasm">
MOV AL, 1h        ; Load AL with immediate value 1
MOV CL, 2h        ; Load CL with immediate value 2
MOV DL, 3h        ; Load DL with immediate value 3
</syntaxhighlight>
The syntax of MOV can also be more complex as the following examples show.<ref name="Evans_2006"/>
<syntaxhighlight lang="nasm">
MOV EAX, [EBX]	  ; Move the 4 bytes in memory at the address contained in EBX into EAX
MOV [ESI+EAX], CL ; Move the contents of CL into the byte at address ESI+EAX
MOV DS, DX        ; Move the contents of DX into segment register DS
</syntaxhighlight>
<!-- The MOV to/from segment register opcodes are included below, so an example involving a segment register should be included. -->
In each case, the MOV mnemonic is translated directly into one of the opcodes 88-8C, 8E, A0-A3, B0-BF, C6 or C7 by an assembler, and the programmer normally does not have to know or remember which.<ref name="Intel_1999"/>

Transforming assembly language into machine code is the job of an assembler, and the reverse can at least partially be achieved by a [[disassembler]]. Unlike [[high-level programming language|high-level languages]], there is a [[bijection|one-to-one correspondence]] between many simple assembly statements and machine language instructions. However, in some cases, an assembler may provide ''pseudoinstructions'' (essentially macros) which expand into several machine language instructions to provide commonly needed functionality. For example, for a machine that lacks a "branch if greater or equal" instruction, an assembler may provide a pseudoinstruction that expands to the machine's "set if less than" and "branch if zero (on the result of the set instruction)". Most full-featured assemblers also provide a rich [[macro (computer science)|macro]] language (discussed below) which is used by vendors and programmers to generate more complex code and data sequences. Since the information about pseudoinstructions and macros defined in the assembler environment is not present in the object program, a disassembler cannot reconstruct the macro and pseudoinstruction invocations but can only disassemble the actual machine instructions that the assembler generated from those abstract assembly-language entities. Likewise, since comments in the assembly language source file are ignored by the assembler and have no effect on the object code it generates, a disassembler is always completely unable to recover source comments.

Each [[computer architecture]] has its own machine language.  Computers differ in the number and type of operations they support, in the different sizes and numbers of registers, and in the representations of data in storage. While most general-purpose computers are able to carry out essentially the same functionality, the ways they do so differ; the corresponding assembly languages reflect these differences.

Multiple sets of [[mnemonic]]s or assembly-language syntax may exist for a single instruction set, typically instantiated in different assembler programs. In these cases, the most popular one is usually that supplied by the CPU manufacturer and used in its documentation.

Two examples of CPUs that have two different sets of mnemonics are the Intel 8080 family and the Intel 8086/8088.  Because Intel claimed copyright on its assembly language mnemonics (on each page of their documentation published in the 1970s and early 1980s, at least), some companies that independently produced CPUs compatible with Intel instruction sets invented their own mnemonics.  The [[Zilog Z80]] CPU, an enhancement of the [[Intel 8080A]], supports all the 8080A instructions plus many more; Zilog invented an entirely new assembly language, not only for the new instructions but also for all of the 8080A instructions.  For example, where Intel uses the mnemonics ''MOV'', ''MVI'', ''LDA'', ''STA'', ''LXI'', ''LDAX'', ''STAX'', ''LHLD'', and ''SHLD'' for various data transfer instructions, the Z80 assembly language uses the mnemonic ''LD'' for all of them.  A similar case is the [[NEC V20]] and [[NEC V30|V30]] CPUs, enhanced copies of the Intel 8086 and 8088, respectively.  Like Zilog with the Z80, NEC invented new mnemonics for all of the 8086 and 8088 instructions, to avoid accusations of infringement of Intel's copyright.  (It is questionable whether such copyrights can be valid, and later CPU companies such as [[AMD]]<ref group="nb" name="NB1"/> and [[Cyrix]] republished Intel's x86/IA-32 instruction mnemonics exactly with neither permission nor legal penalty.)  It is doubtful whether in practice many people who programmed the V20 and V30 actually wrote in NEC's assembly language rather than Intel's; since any two assembly languages for the same instruction set architecture are isomorphic (somewhat like English and [[Pig Latin]]), there is no requirement to use a manufacturer's own published assembly language with that manufacturer's products.

=== "Hello, world!" on x86 Linux ===
In 32-bit assembly language for Linux on an [[x86]] processor, "Hello, world!" can be printed like this.
<syntaxhighlight lang="nasm">
section	.text
   global _start
	
_start:	        
   mov	edx,len     ; length of string, third argument to write()
   mov	ecx,msg     ; address of string, second argument to write()
   mov	ebx,1       ; file descriptor (standard output), first argument to write()
   mov	eax,4       ; system call number for write()
   int	0x80        ; system call trap
	
   mov	ebx,0       ; exit code, first argument to exit()
   mov	eax,1       ; system call number for exit()
   int	0x80        ; system call trap

section	.data
msg db 'Hello, world!', 0xa  
len equ $ - msg
</syntaxhighlight>