Editing ARM architecture family (section)

===Instruction set===
The original (and subsequent) ARM implementation was hardwired without [[microcode]], like the much simpler [[8-bit computing|8-bit]] [[MOS Technology 6502|6502]] processor used in prior Acorn microcomputers.

The 32-bit ARM architecture (and the 64-bit architecture for the most part) includes the following RISC features:
* [[Load–store architecture]].
* No support for [[data structure alignment|unaligned memory accesses]] in the original version of the architecture. ARMv6 and later, except some microcontroller versions, support unaligned accesses for half-word and single-word load/store instructions with some limitations, such as no guaranteed [[linearizability|atomicity]].<ref>{{cite web |url=http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka15414.html |title=How does the ARM Compiler support unaligned accesses? |year=2011 |access-date=5 October 2013 |url-status=dead |archive-url=https://web.archive.org/web/20131014084800/http://infocenter.arm.com/help/topic/com.arm.doc.faqs/ka15414.html |archive-date=14 October 2013}}</ref><ref>{{cite web |url=http://www.heyrick.co.uk/armwiki/Unaligned_data_access |title=Unaligned data access |access-date=5 October 2013}}</ref>
* Uniform 16 × 32-bit [[register file]] (including the program counter, stack pointer and the link register).
* Fixed instruction width of 32&nbsp;bits to ease decoding and [[instruction pipelining|pipelining]], at the cost of decreased [[code density]]. Later, the [[#Thumb|Thumb instruction set]] added 16-bit instructions and increased code density.
* Mostly single clock-cycle execution.

To compensate for the simpler design, compared with processors like the Intel 80286 and [[Motorola 68020]], some additional design features were used:
* Conditional execution of most instructions reduces branch overhead and compensates for the lack of a [[branch predictor]] in early chips.
* Arithmetic instructions alter [[Condition Code Register|condition code]]s only when desired.
* 32-bit [[barrel shifter]] can be used without performance penalty with most arithmetic instructions and address calculations.
* Has powerful indexed [[addressing mode]]s.
* A [[link register]] supports fast leaf function calls.
* A simple, but fast, 2-priority-level [[interrupt]] subsystem has switched register banks.

====Arithmetic instructions====
ARM includes integer arithmetic operations for add, subtract, and multiply; some versions of the architecture also support divide operations.

ARM supports 32-bit × 32-bit multiplies with either a 32-bit result or 64-bit result, though Cortex-M0 / M0+ / M1 cores do not support 64-bit results.<ref name="M0-TRM">{{cite web |url=http://infocenter.arm.com/help/topic/com.arm.doc.ddi0432c/DDI0432C_cortex_m0_r0p0_trm.pdf |title=Cortex-M0 r0p0 Technical Reference Manual |website=Arm}}</ref> Some ARM cores also support 16-bit × 16-bit and 32-bit × 16-bit multiplies.

The divide instructions are only included in the following ARM architectures:
* Armv7-M and Armv7E-M architectures always include divide instructions.<ref>{{cite web |url=https://developer.arm.com/documentation/ddi0403/latest/ |title=ARMv7-M Architecture Reference Manual |publisher=Arm |access-date=18 July 2022}}</ref>
* Armv7-R architecture always includes divide instructions in the Thumb instruction set, but optionally in its 32-bit instruction set.<ref name="ARMv7-AR-Ref">{{cite web |url=https://developer.arm.com/documentation/ddi0406/latest |title=ARMv7-A and ARMv7-R Architecture Reference Manual; Arm Holdings |publisher=arm.com |access-date=19 January 2013}}</ref>
* Armv7-A architecture optionally includes the divide instructions. The instructions might not be implemented, or implemented only in the Thumb instruction set, or implemented in both the Thumb and ARM instruction sets, or implemented if the Virtualization Extensions are included.<ref name="ARMv7-AR-Ref"/>

====Registers====
{| class="wikitable" style="float: right; margin-left: 1.5em; margin-right: 0; margin-top: 0;"
|+ Registers across CPU modes
|-
! usr !! sys !! svc !! abt !! und !! [[Interrupt request|irq]] !! [[Fast interrupt request|fiq]]
|-
| colspan="7" style="text-align:center;"| R0
|-
| colspan="7" style="text-align:center;"| R1
|-
| colspan="7" style="text-align:center;"| R2
|-
| colspan="7" style="text-align:center;"| R3
|-
| colspan="7" style="text-align:center;"| R4
|-
| colspan="7" style="text-align:center;"| R5
|-
| colspan="7" style="text-align:center;"| R6
|-
| colspan="7" style="text-align:center;"| R7
|- align=center
| colspan=6 | R8 || R8_fiq
|- align=center
| colspan=6 | R9 || R9_fiq
|- align=center
| colspan=6 | R10 || R10_fiq
|- align=center
| colspan=6 | R11 || R11_fiq
|- align=center
| colspan=6 | R12 || R12_fiq
|- align=center
| colspan=2 | R13 || R13_svc || R13_abt || R13_und || R13_irq || R13_fiq
|- align=center
| colspan=2 | R14 || R14_svc || R14_abt || R14_und || R14_irq || R14_fiq
|-
| colspan="7" style="text-align:center;"| R15
|-
| colspan="7" style="text-align:center;"| CPSR
|- align=center
| colspan=2 | || SPSR_svc || SPSR_abt || SPSR_und || SPSR_irq || SPSR_fiq
|}

Registers R0 through R7 are the same across all CPU modes; they are never banked.

Registers R8 through R12 are the same across all CPU modes except FIQ mode.  FIQ mode has its own distinct R8 through R12 registers.

R13 and R14 are banked across all privileged CPU modes except system mode. That is, each mode that can be entered because of an exception has its own R13 and R14. These registers generally contain the stack pointer and the return address from function calls, respectively.

Aliases:
* R13 is also referred to as SP, the [[stack pointer]].
* R14 is also referred to as LR, the [[link register]].
* R15 is also referred to as PC, the [[program counter]].

The Current Program Status Register (CPSR) has the following 32&nbsp;bits.<ref>{{cite web |url=http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/I27695.html |title=ARM Information Center |access-date=10 July 2015}}</ref>
* M (bits 0–4) is the processor mode bits.
* T (bit 5) is the Thumb state bit.
* F (bit 6) is the FIQ disable bit.
* I (bit 7) is the IRQ disable bit.
* A (bit 8) is the imprecise data abort disable bit.
* E (bit 9) is the data endianness bit.
* IT (bits 10–15 and 25–26) is the if-then state bits.
* GE (bits 16–19) is the greater-than-or-equal-to bits.
* DNM (bits 20–23) is the do not modify bits.
* J (bit 24) is the Java state bit.
* Q (bit 27) is the sticky overflow bit.
* V (bit 28) is the overflow bit.
* C (bit 29) is the carry/borrow/extend bit.
* Z (bit 30) is the zero bit.
* N (bit 31) is the negative/less than bit.

====Conditional execution====
Almost every ARM instruction has a conditional execution feature called [[predication (computer architecture)|predication]], which is implemented with a 4-bit condition code selector (the predicate). To allow for unconditional execution, one of the four-bit codes causes the instruction to be always executed. Most other CPU architectures only have condition codes on branch instructions.<ref>{{cite web |title=Condition Codes 1: Condition flags and codes |url=https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/condition-codes-1-condition-flags-and-codes |website=ARM Community |date=11 September 2013 |access-date=26 September 2019}}</ref>

Though the predicate takes up four of the 32&nbsp;bits in an instruction code, and thus cuts down significantly on the encoding bits available for displacements in memory access instructions, it avoids branch instructions when generating code for small [[conditional (programming)|<code>if</code> statements]]. Apart from eliminating the branch instructions themselves, this preserves the fetch/decode/execute pipeline at the cost of only one cycle per skipped instruction.

An algorithm that provides a good example of conditional execution is the subtraction-based [[Euclidean algorithm]] for computing the [[greatest common divisor]].  In the [[C (programming language)|C programming language]], the algorithm can be written as:

<syntaxhighlight lang="c">
int gcd(int a, int b) {
  while (a != b)  // We enter the loop when a < b or a > b, but not when a == b
    if (a > b)   // When a > b we do this
      a -= b;
    else         // When a < b we do that (no "if (a < b)" needed since a != b is checked in while condition)
      b -= a;
  return a;
}
</syntaxhighlight>

The same algorithm can be rewritten in a way closer to target ARM [[instruction set architecture|instructions]] as:

<syntaxhighlight lang="c">
loop:
    // Compare a and b
    GT = a > b;
    LT = a < b;
    NE = a != b;

    // Perform operations based on flag results
    if (GT) a -= b;    // Subtract *only* if greater-than
    if (LT) b -= a;    // Subtract *only* if less-than
    if (NE) goto loop; // Loop *only* if compared values were not equal
    return a;
</syntaxhighlight>
and coded in [[assembly language]] as:<!-- using nasm because "gas", although correct, does not recognize all the insns. the nasm lexer just looks for all uppercase on the other hand. -->
<syntaxhighlight lang="nasm">
; assign a to register r0, b to r1
loop:   CMP    r0, r1       ; set condition "NE" if (a ≠ b),
                            ;               "GT" if (a > b),
                            ;            or "LT" if (a < b)
        SUBGT  r0, r0, r1   ; if "GT" (Greater Than), then a = a − b
        SUBLT  r1, r1, r0   ; if "LT" (Less    Than), then b = b − a
        BNE  loop           ; if "NE" (Not Equal), then loop
        B    lr             ; return
</syntaxhighlight>
which avoids the branches around the <code>then</code> and <code>else</code> clauses. If <code>r0</code> and <code>r1</code> are equal then neither of the <code>SUB</code> instructions will be executed, eliminating the need for a conditional branch to implement the <code>while</code> check at the top of the loop, for example had <code>SUBLE</code> (less than or equal) been used.

One of the ways that Thumb code provides a more dense encoding is to remove the four-bit selector from non-branch instructions.

====Other features====
Another feature of the [[instruction set]] is the ability to fold shifts and rotates into the ''data processing'' (arithmetic, logical, and register-register move) instructions, so that, for example, the statement in [[C (programming language)|C]] language:

<syntaxhighlight lang="c">a += (j << 2);</syntaxhighlight>

could be rendered as a one-word, one-cycle instruction:<ref>{{cite web |url=http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0214b/ch09s01s02.html |title=9.1.2. Instruction cycle counts}}</ref>

<syntaxhighlight lang="nasm">ADD  Ra, Ra, Rj, LSL #2</syntaxhighlight>

This results in the typical ARM program being denser than expected with fewer memory accesses; thus the pipeline is used more efficiently.

The ARM processor also has features rarely seen in other RISC architectures, such as [[program counter|PC]]-relative addressing (indeed, on the 32-bit<ref name="v8arch">{{cite web |url=https://www.arm.com/files/downloads/ARMv8_Architecture.pdf |title=ARMv8-A Technology Preview |year=2011 |access-date=31 October 2011 |first=Richard |last=Grisenthwaite |archive-url=https://web.archive.org/web/20111111161327/https://www.arm.com/files/downloads/ARMv8_Architecture.pdf |archive-date=11 November 2011}}</ref> ARM the [[program counter|PC]] is one of its 16&nbsp;registers) and pre- and post-increment addressing modes.

The ARM instruction set has increased over time. Some early ARM processors (before ARM7TDMI), for example, have no instruction to store a two-byte quantity.

====Pipelines and other implementation issues====
The ARM7 and earlier implementations have a three-stage [[instruction pipelining|pipeline]]; the stages being fetch, decode, and execute. Higher-performance designs, such as the ARM9, have deeper pipelines: Cortex-A8 has thirteen stages. Additional implementation changes for higher performance include a faster [[adder (electronics)|adder]] and more extensive [[branch prediction]] logic. The difference between the ARM7DI and ARM7DMI cores, for example, was an improved multiplier; hence the added "M".

====Coprocessors====
The ARM architecture (pre-Armv8) provides a non-intrusive way of extending the instruction set using "coprocessors" that can be addressed using MCR, MRC, MRRC, MCRR, and similar instructions. The coprocessor space is divided logically into 16&nbsp;coprocessors with numbers from 0 to 15, coprocessor&nbsp;15 (cp15) being reserved for some typical control functions like managing the caches and [[memory management unit|MMU]] operation on processors that have one.

In ARM-based machines, peripheral devices are usually attached to the processor by mapping their physical registers into ARM memory space, into the coprocessor space, or by connecting to another device (a bus) that in turn attaches to the processor. Coprocessor accesses have lower latency, so some peripherals—for example, an XScale interrupt controller—are accessible in both ways: through memory and through coprocessors.

In other cases, chip designers only integrate hardware using the coprocessor mechanism. For example, an image processing engine might be a small ARM7TDMI core combined with a coprocessor that has specialised operations to support a specific set of HDTV transcoding primitives.