Editing ARM architecture family (section)

=={{anchor|32-bit|ARMv7-A|AArch32}}32-bit architecture==
[[File:Raspberry-Pi-2-Bare-BR.jpg|alt=|thumb|An ARMv7 was used to power older versions of the popular [[Raspberry Pi]] single-board computers like this Raspberry Pi 2 from 2015.]]
[[File:Cubox.png|thumb|right|upright=1.2|An ARMv7 is also used to power the [[CuBox]] family of single-board computers.]]

{{See also|Comparison of ARMv7-A processors}}
The 32-bit ARM architecture ('''ARM32'''), such as '''ARMv7-A''' (implementing AArch32; see [[#Armv8-A|section on Armv8-A]] for more on it), was the most widely used architecture in mobile devices {{as of|2011|lc=y}}.<ref name="popular">{{cite journal |last1=Fitzpatrick |first1=J. |title=An Interview with Steve Furber |doi=10.1145/1941487.1941501 |journal=[[Communications of the ACM]] |volume=54 |issue=5 |pages=34–39 |year=2011 |doi-access=free}}</ref>

Since 1995, various versions of the ''ARM Architecture Reference Manual'' (see {{section link||External links}}) have been the primary source of documentation on the ARM processor architecture and instruction set, distinguishing interfaces that all ARM processors are required to support (such as instruction semantics) from implementation details that may vary. The architecture has evolved over time, and version seven of the architecture, ARMv7, defines three architecture "profiles":
* A-profile, the "Application" profile, implemented by 32-bit cores in the [[ARM Cortex-A|Cortex-A]] series and by some non-ARM cores
* R-profile, the "Real-time" profile, implemented by cores in the [[ARM Cortex-R|Cortex-R]] series
* M-profile, the "Microcontroller" profile, implemented by most cores in the [[ARM Cortex-M|Cortex-M]] series

Although the architecture profiles were first defined for ARMv7, ARM subsequently defined the ARMv6-M architecture (used by the Cortex [[ARM Cortex-M0|M0]]/[[ARM Cortex-M0+|M0+]]/[[ARM Cortex-M1|M1]]) as a subset of the ARMv7-M profile with fewer instructions.

===CPU modes===
Except in the M-profile, the 32-bit ARM architecture specifies several CPU modes, depending on the implemented architecture features. At any moment in time, the CPU can be in only one mode, but it can switch modes due to external events (interrupts) or programmatically.<ref name="Chdddhea">{{cite web |url=http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204j/Chdddhea.html |title=Processor mode |publisher=[[Arm Holdings]] |access-date=26 March 2013}}</ref>
* ''User mode:'' The only non-privileged mode.
* ''FIQ mode:'' A privileged mode that is entered whenever the processor accepts a [[fast interrupt request]].
* ''IRQ mode:'' A privileged mode that is entered whenever the processor accepts an interrupt.
* ''Supervisor (svc) mode:'' A privileged mode entered whenever the CPU is reset or when an SVC instruction is executed.
* ''Abort mode:'' A privileged mode that is entered whenever a prefetch abort or data abort exception occurs.
* ''Undefined mode:'' A privileged mode that is entered whenever an undefined instruction exception occurs.
* ''System mode (ARMv4 and above):'' The only privileged mode that is not entered by an exception. It can only be entered by executing an instruction that explicitly writes to the mode bits of the Current Program Status Register (CPSR) from another privileged mode (not from user mode).
* ''Monitor mode (ARMv6 and ARMv7 Security Extensions, ARMv8 EL3):'' A monitor mode is introduced to support TrustZone extension in ARM cores.
* ''Hyp mode (ARMv7 Virtualization Extensions, ARMv8 EL2):'' A hypervisor mode that supports [[Popek and Goldberg virtualization requirements]] for the non-secure operation of the CPU.<ref name="2012-lpc-arm-zyngier">{{cite web |url=https://blog.linuxplumbersconf.org/2012/wp-content/uploads/2012/09/2012-lpc-arm-zyngier.pdf |title=KVM/ARM |access-date=14 February 2023}}</ref><ref>{{cite conference |title=Extensions to the ARMv7-A Architecture |first=David |last=Brash |conference=2010 IEEE Hot Chips 22 Symposium (HCS) |date=August 2010 |pages=1–21 |doi=10.1109/HOTCHIPS.2010.7480070 |isbn=978-1-4673-8875-7 |s2cid=46339775 }}</ref>
* ''Thread mode (ARMv6-M, ARMv7-M, ARMv8-M):'' A mode which can be specified as either privileged or unprivileged. Whether the Main Stack Pointer (MSP) or Process Stack Pointer (PSP) is used can also be specified in CONTROL register with privileged access. This mode is designed for user tasks in RTOS environment but it is typically used in bare-metal for super-loop.
* ''Handler mode (ARMv6-M, ARMv7-M, ARMv8-M):'' A mode dedicated for exception handling (except the RESET which are handled in Thread mode). Handler mode always uses MSP and works in privileged level.

===Instruction set===
The original (and subsequent) ARM implementation was hardwired without [[microcode]], like the much simpler [[8-bit computing|8-bit]] [[MOS Technology 6502|6502]] processor used in prior Acorn microcomputers.

The 32-bit ARM architecture (and the 64-bit architecture for the most part) includes the following RISC features:
* [[Load–store architecture]].
* No support for [[data structure alignment|unaligned memory accesses]] in the original version of the architecture. ARMv6 and later, except some microcontroller versions, support unaligned accesses for half-word and single-word load/store instructions with some limitations, such as no guaranteed [[linearizability|atomicity]].<ref>{{cite web |url=http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka15414.html |title=How does the ARM Compiler support unaligned accesses? |year=2011 |access-date=5 October 2013 |url-status=dead |archive-url=https://web.archive.org/web/20131014084800/http://infocenter.arm.com/help/topic/com.arm.doc.faqs/ka15414.html |archive-date=14 October 2013}}</ref><ref>{{cite web |url=http://www.heyrick.co.uk/armwiki/Unaligned_data_access |title=Unaligned data access |access-date=5 October 2013}}</ref>
* Uniform 16 × 32-bit [[register file]] (including the program counter, stack pointer and the link register).
* Fixed instruction width of 32&nbsp;bits to ease decoding and [[instruction pipelining|pipelining]], at the cost of decreased [[code density]]. Later, the [[#Thumb|Thumb instruction set]] added 16-bit instructions and increased code density.
* Mostly single clock-cycle execution.

To compensate for the simpler design, compared with processors like the Intel 80286 and [[Motorola 68020]], some additional design features were used:
* Conditional execution of most instructions reduces branch overhead and compensates for the lack of a [[branch predictor]] in early chips.
* Arithmetic instructions alter [[Condition Code Register|condition code]]s only when desired.
* 32-bit [[barrel shifter]] can be used without performance penalty with most arithmetic instructions and address calculations.
* Has powerful indexed [[addressing mode]]s.
* A [[link register]] supports fast leaf function calls.
* A simple, but fast, 2-priority-level [[interrupt]] subsystem has switched register banks.

====Arithmetic instructions====
ARM includes integer arithmetic operations for add, subtract, and multiply; some versions of the architecture also support divide operations.

ARM supports 32-bit × 32-bit multiplies with either a 32-bit result or 64-bit result, though Cortex-M0 / M0+ / M1 cores do not support 64-bit results.<ref name="M0-TRM">{{cite web |url=http://infocenter.arm.com/help/topic/com.arm.doc.ddi0432c/DDI0432C_cortex_m0_r0p0_trm.pdf |title=Cortex-M0 r0p0 Technical Reference Manual |website=Arm}}</ref> Some ARM cores also support 16-bit × 16-bit and 32-bit × 16-bit multiplies.

The divide instructions are only included in the following ARM architectures:
* Armv7-M and Armv7E-M architectures always include divide instructions.<ref>{{cite web |url=https://developer.arm.com/documentation/ddi0403/latest/ |title=ARMv7-M Architecture Reference Manual |publisher=Arm |access-date=18 July 2022}}</ref>
* Armv7-R architecture always includes divide instructions in the Thumb instruction set, but optionally in its 32-bit instruction set.<ref name="ARMv7-AR-Ref">{{cite web |url=https://developer.arm.com/documentation/ddi0406/latest |title=ARMv7-A and ARMv7-R Architecture Reference Manual; Arm Holdings |publisher=arm.com |access-date=19 January 2013}}</ref>
* Armv7-A architecture optionally includes the divide instructions. The instructions might not be implemented, or implemented only in the Thumb instruction set, or implemented in both the Thumb and ARM instruction sets, or implemented if the Virtualization Extensions are included.<ref name="ARMv7-AR-Ref"/>

====Registers====
{| class="wikitable" style="float: right; margin-left: 1.5em; margin-right: 0; margin-top: 0;"
|+ Registers across CPU modes
|-
! usr !! sys !! svc !! abt !! und !! [[Interrupt request|irq]] !! [[Fast interrupt request|fiq]]
|-
| colspan="7" style="text-align:center;"| R0
|-
| colspan="7" style="text-align:center;"| R1
|-
| colspan="7" style="text-align:center;"| R2
|-
| colspan="7" style="text-align:center;"| R3
|-
| colspan="7" style="text-align:center;"| R4
|-
| colspan="7" style="text-align:center;"| R5
|-
| colspan="7" style="text-align:center;"| R6
|-
| colspan="7" style="text-align:center;"| R7
|- align=center
| colspan=6 | R8 || R8_fiq
|- align=center
| colspan=6 | R9 || R9_fiq
|- align=center
| colspan=6 | R10 || R10_fiq
|- align=center
| colspan=6 | R11 || R11_fiq
|- align=center
| colspan=6 | R12 || R12_fiq
|- align=center
| colspan=2 | R13 || R13_svc || R13_abt || R13_und || R13_irq || R13_fiq
|- align=center
| colspan=2 | R14 || R14_svc || R14_abt || R14_und || R14_irq || R14_fiq
|-
| colspan="7" style="text-align:center;"| R15
|-
| colspan="7" style="text-align:center;"| CPSR
|- align=center
| colspan=2 | || SPSR_svc || SPSR_abt || SPSR_und || SPSR_irq || SPSR_fiq
|}

Registers R0 through R7 are the same across all CPU modes; they are never banked.

Registers R8 through R12 are the same across all CPU modes except FIQ mode.  FIQ mode has its own distinct R8 through R12 registers.

R13 and R14 are banked across all privileged CPU modes except system mode. That is, each mode that can be entered because of an exception has its own R13 and R14. These registers generally contain the stack pointer and the return address from function calls, respectively.

Aliases:
* R13 is also referred to as SP, the [[stack pointer]].
* R14 is also referred to as LR, the [[link register]].
* R15 is also referred to as PC, the [[program counter]].

The Current Program Status Register (CPSR) has the following 32&nbsp;bits.<ref>{{cite web |url=http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/I27695.html |title=ARM Information Center |access-date=10 July 2015}}</ref>
* M (bits 0–4) is the processor mode bits.
* T (bit 5) is the Thumb state bit.
* F (bit 6) is the FIQ disable bit.
* I (bit 7) is the IRQ disable bit.
* A (bit 8) is the imprecise data abort disable bit.
* E (bit 9) is the data endianness bit.
* IT (bits 10–15 and 25–26) is the if-then state bits.
* GE (bits 16–19) is the greater-than-or-equal-to bits.
* DNM (bits 20–23) is the do not modify bits.
* J (bit 24) is the Java state bit.
* Q (bit 27) is the sticky overflow bit.
* V (bit 28) is the overflow bit.
* C (bit 29) is the carry/borrow/extend bit.
* Z (bit 30) is the zero bit.
* N (bit 31) is the negative/less than bit.

====Conditional execution====
Almost every ARM instruction has a conditional execution feature called [[predication (computer architecture)|predication]], which is implemented with a 4-bit condition code selector (the predicate). To allow for unconditional execution, one of the four-bit codes causes the instruction to be always executed. Most other CPU architectures only have condition codes on branch instructions.<ref>{{cite web |title=Condition Codes 1: Condition flags and codes |url=https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/condition-codes-1-condition-flags-and-codes |website=ARM Community |date=11 September 2013 |access-date=26 September 2019}}</ref>

Though the predicate takes up four of the 32&nbsp;bits in an instruction code, and thus cuts down significantly on the encoding bits available for displacements in memory access instructions, it avoids branch instructions when generating code for small [[conditional (programming)|<code>if</code> statements]]. Apart from eliminating the branch instructions themselves, this preserves the fetch/decode/execute pipeline at the cost of only one cycle per skipped instruction.

An algorithm that provides a good example of conditional execution is the subtraction-based [[Euclidean algorithm]] for computing the [[greatest common divisor]].  In the [[C (programming language)|C programming language]], the algorithm can be written as:

<syntaxhighlight lang="c">
int gcd(int a, int b) {
  while (a != b)  // We enter the loop when a < b or a > b, but not when a == b
    if (a > b)   // When a > b we do this
      a -= b;
    else         // When a < b we do that (no "if (a < b)" needed since a != b is checked in while condition)
      b -= a;
  return a;
}
</syntaxhighlight>

The same algorithm can be rewritten in a way closer to target ARM [[instruction set architecture|instructions]] as:

<syntaxhighlight lang="c">
loop:
    // Compare a and b
    GT = a > b;
    LT = a < b;
    NE = a != b;

    // Perform operations based on flag results
    if (GT) a -= b;    // Subtract *only* if greater-than
    if (LT) b -= a;    // Subtract *only* if less-than
    if (NE) goto loop; // Loop *only* if compared values were not equal
    return a;
</syntaxhighlight>
and coded in [[assembly language]] as:<!-- using nasm because "gas", although correct, does not recognize all the insns. the nasm lexer just looks for all uppercase on the other hand. -->
<syntaxhighlight lang="nasm">
; assign a to register r0, b to r1
loop:   CMP    r0, r1       ; set condition "NE" if (a ≠ b),
                            ;               "GT" if (a > b),
                            ;            or "LT" if (a < b)
        SUBGT  r0, r0, r1   ; if "GT" (Greater Than), then a = a − b
        SUBLT  r1, r1, r0   ; if "LT" (Less    Than), then b = b − a
        BNE  loop           ; if "NE" (Not Equal), then loop
        B    lr             ; return
</syntaxhighlight>
which avoids the branches around the <code>then</code> and <code>else</code> clauses. If <code>r0</code> and <code>r1</code> are equal then neither of the <code>SUB</code> instructions will be executed, eliminating the need for a conditional branch to implement the <code>while</code> check at the top of the loop, for example had <code>SUBLE</code> (less than or equal) been used.

One of the ways that Thumb code provides a more dense encoding is to remove the four-bit selector from non-branch instructions.

====Other features====
Another feature of the [[instruction set]] is the ability to fold shifts and rotates into the ''data processing'' (arithmetic, logical, and register-register move) instructions, so that, for example, the statement in [[C (programming language)|C]] language:

<syntaxhighlight lang="c">a += (j << 2);</syntaxhighlight>

could be rendered as a one-word, one-cycle instruction:<ref>{{cite web |url=http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0214b/ch09s01s02.html |title=9.1.2. Instruction cycle counts}}</ref>

<syntaxhighlight lang="nasm">ADD  Ra, Ra, Rj, LSL #2</syntaxhighlight>

This results in the typical ARM program being denser than expected with fewer memory accesses; thus the pipeline is used more efficiently.

The ARM processor also has features rarely seen in other RISC architectures, such as [[program counter|PC]]-relative addressing (indeed, on the 32-bit<ref name="v8arch">{{cite web |url=https://www.arm.com/files/downloads/ARMv8_Architecture.pdf |title=ARMv8-A Technology Preview |year=2011 |access-date=31 October 2011 |first=Richard |last=Grisenthwaite |archive-url=https://web.archive.org/web/20111111161327/https://www.arm.com/files/downloads/ARMv8_Architecture.pdf |archive-date=11 November 2011}}</ref> ARM the [[program counter|PC]] is one of its 16&nbsp;registers) and pre- and post-increment addressing modes.

The ARM instruction set has increased over time. Some early ARM processors (before ARM7TDMI), for example, have no instruction to store a two-byte quantity.

====Pipelines and other implementation issues====
The ARM7 and earlier implementations have a three-stage [[instruction pipelining|pipeline]]; the stages being fetch, decode, and execute. Higher-performance designs, such as the ARM9, have deeper pipelines: Cortex-A8 has thirteen stages. Additional implementation changes for higher performance include a faster [[adder (electronics)|adder]] and more extensive [[branch prediction]] logic. The difference between the ARM7DI and ARM7DMI cores, for example, was an improved multiplier; hence the added "M".

====Coprocessors====
The ARM architecture (pre-Armv8) provides a non-intrusive way of extending the instruction set using "coprocessors" that can be addressed using MCR, MRC, MRRC, MCRR, and similar instructions. The coprocessor space is divided logically into 16&nbsp;coprocessors with numbers from 0 to 15, coprocessor&nbsp;15 (cp15) being reserved for some typical control functions like managing the caches and [[memory management unit|MMU]] operation on processors that have one.

In ARM-based machines, peripheral devices are usually attached to the processor by mapping their physical registers into ARM memory space, into the coprocessor space, or by connecting to another device (a bus) that in turn attaches to the processor. Coprocessor accesses have lower latency, so some peripherals—for example, an XScale interrupt controller—are accessible in both ways: through memory and through coprocessors.

In other cases, chip designers only integrate hardware using the coprocessor mechanism. For example, an image processing engine might be a small ARM7TDMI core combined with a coprocessor that has specialised operations to support a specific set of HDTV transcoding primitives.

==={{Anchor|CoreSight}}Debugging===
{{More citations needed section|date=March 2011}}
All modern ARM processors include hardware debugging facilities, allowing software debuggers to perform operations such as halting, stepping, and breakpointing of code starting from reset. These facilities are built using [[JTAG]] support, though some newer cores optionally support ARM's own two-wire "SWD" protocol. In ARM7TDMI cores, the "D" represented JTAG debug support, and the "I" represented presence of an "EmbeddedICE" debug module. For ARM7 and ARM9 core generations, EmbeddedICE over JTAG was a de facto debug standard, though not architecturally guaranteed.

The ARMv7 architecture defines basic debug facilities at an architectural level. These include breakpoints, watchpoints and instruction execution in a "Debug Mode"; similar facilities were also available with EmbeddedICE. Both "halt mode" and "monitor" mode debugging are supported. The actual transport mechanism used to access the debug facilities is not architecturally specified, but implementations generally include JTAG support.

There is a separate ARM "CoreSight" debug architecture, which is not architecturally required by ARMv7 processors.

====Debug Access Port====
The Debug Access Port (DAP) is an implementation of an ARM Debug Interface.<ref>
{{cite web |url=http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0314h/Babdadfc.html |title=CoreSight Components: About the Debug Access Port}}
</ref>
There are two different supported implementations, the Serial Wire [[JTAG]] Debug Port (SWJ-DP) and the Serial Wire Debug Port (SW-DP).<ref>
{{cite web|url=http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0179b/ar01s01s03.html |title=The Cortex-M3: Debug Access Port (DAP)}}
</ref>
CMSIS-DAP is a standard interface that describes how various debugging software on a host PC can communicate over USB to firmware running on a hardware debugger, which in turn talks over SWD or JTAG to a CoreSight-enabled ARM Cortex CPU.<ref>
{{cite web |first=Mike |last=Anderson
|url=https://elinux.org/images/7/7f/Manderson5.pdf |title=Understanding ARM HW Debug Options}}
</ref><ref>
{{cite web |url=http://www.keil.com/support/man/docs/dapdebug/dapdebug_introduction.htm |title=CMSIS-DAP Debugger User's Guide}}
</ref><ref>
{{cite web |url=https://www.nimblemachines.com/cmsis-dap/ |title=CMSIS-DAP}}
</ref>

===DSP enhancement instructions===
To improve the ARM architecture for [[digital signal processing]] and multimedia applications, DSP instructions were added to the instruction set.<ref>{{cite web |url=https://www.arm.com/products/CPUs/cpu-arch-DSP.html |title=ARM DSP Instruction Set Extensions |website=arm.com |access-date=18 April 2009 |archive-url=https://web.archive.org/web/20090414011837/https://www.arm.com/products/CPUs/cpu-arch-DSP.html |archive-date=14 April 2009 |url-status=live}}</ref> These are signified by an "E" in the name of the ARMv5TE and ARMv5TEJ architectures. E-variants also imply T, D, M, and I.

The new instructions are common in [[digital signal processor]] (DSP) architectures. They include variations on signed [[multiply–accumulate operation|multiply–accumulate]], [[saturation arithmetic|saturated add and subtract]], and [[count leading zeros]].

First introduced in 1999, this extension of the core instruction set contrasted with ARM's earlier DSP coprocessor known as Piccolo, which employed a distinct, incompatible instruction set whose execution involved a separate program counter.<ref name="eetimes19990503">{{ cite magazine | url=https://www.eetimes.com/epf-arc-arm-add-dsp-extensions-to-their-risc-cores/ | title=EPF: ARC, ARM add DSP extensions to their RISC cores | website=EE Times | last1=Clarke | first1=Peter | date=3 May 1999 | access-date=15 March 2024 }}</ref> Piccolo instructions employed a distinct register file of sixteen 32-bit registers, with some instructions combining registers for use as 48-bit accumulators and other instructions addressing 16-bit half-registers. Some instructions were able to operate on two such 16-bit values in parallel. Communication with the Piccolo register file involved ''load to Piccolo'' and ''store from Piccolo'' coprocessor instructions via two buffers of eight 32-bit entries. Described as reminiscent of other approaches, notably Hitachi's SH-DSP and Motorola's 68356, Piccolo did not employ dedicated local memory and relied on the bandwidth of the ARM core for DSP operand retrieval, impacting concurrent performance.<ref name="microprocessorreport19961118_piccolo">{{ cite magazine | url=https://www.cecs.uci.edu/~papers/mpr/MPR/ARTICLES/101504.PDF | title=ARM Tunes Piccolo for DSP Performance | magazine=Microprocessor Report | last1=Turley | first1=Jim | date=18 November 1996 | access-date=15 March 2024 }}</ref> Piccolo's distinct instruction set also proved not to be a "good compiler target".<ref name="eetimes19990503"/>

===SIMD extensions for multimedia===
Introduced in the ARMv6 architecture, this was a precursor to Advanced SIMD, also named [[#Advanced SIMD (Neon)|Neon]].<ref>{{cite web |url=https://www.arm.com/products/processors/technologies/dsp-simd.php |title=DSP & SIMD |access-date=10 July 2015}}</ref>

===Jazelle===
<!-- Section header used in redirects -->
{{Main|Jazelle}}
Jazelle DBX (Direct Bytecode eXecution) is a technique that allows [[Java bytecode]] to be executed directly in the ARM architecture as a third execution state (and instruction set) alongside the existing ARM and Thumb-mode. Support for this state is signified by the "J" in the ARMv5TEJ architecture, and in ARM9EJ-S and ARM7EJ-S core names. Support for this state is required starting in ARMv6 (except for the ARMv7-M profile), though newer cores only include a trivial implementation that provides no hardware acceleration.

==={{anchor|THUMB}}Thumb===
To improve compiled code density, processors since the ARM7TDMI (released in 1994<ref>{{cite web |url=http://www.atmel.com/Images/DDI0029G_7TDMI_R3_trm.pdf |title=ARM7TDMI Technical Reference Manual |page=ii}}</ref>) have featured the ''Thumb'' [[compressed instruction set]], which have their own state. (The "T" in "TDMI" indicates the Thumb feature.) When in this state, the processor executes the Thumb instruction set, a compact 16-bit encoding for a subset of the ARM instruction set.<ref>{{cite book |last=Jaggar |first=Dave |title=ARM Architecture Reference Manual |year=1996 |publisher=Prentice Hall |isbn=978-0-13-736299-8 |pages=6–1}}</ref> Most of the Thumb instructions are directly mapped to normal ARM instructions. The space saving comes from making some of the instruction operands implicit and limiting the number of possibilities compared to the ARM instructions executed in the ARM instruction set state.

In Thumb, the 16-bit opcodes have less functionality. For example, only branches can be conditional, and many opcodes are restricted to accessing only half of all of the CPU's general-purpose registers. The shorter opcodes give improved code density overall, even though some operations require extra instructions. In situations where the memory port or bus width is constrained to less than 32&nbsp;bits, the shorter Thumb opcodes allow increased performance compared with 32-bit ARM code, as less program code may need to be loaded into the processor over the constrained memory bandwidth.

Unlike processor architectures with variable length (16- or 32-bit) instructions, such as the Cray-1 and [[Hitachi]] [[SuperH]], the ARM and Thumb instruction sets exist independently of each other. Embedded hardware, such as the [[Game Boy Advance]], typically have a small amount of RAM accessible with a full 32-bit datapath; the majority is accessed via a 16-bit or narrower secondary datapath. In this situation, it usually makes sense to compile Thumb code and hand-optimise a few of the most CPU-intensive sections using full 32-bit ARM instructions, placing these wider instructions into the 32-bit bus accessible memory.

The first processor with a Thumb [[instruction decoder]] was the ARM7TDMI. All processors supporting 32-bit instruction sets, starting with ARM9, and including XScale, have included a Thumb instruction decoder. It includes instructions adopted from the Hitachi [[SuperH]] (1992), which was licensed by ARM.<ref name="lwn">{{cite web |url=http://lwn.net/Articles/647636 |title=Resurrecting the SuperH architecture |author=Nathan Willis |date=10 June 2015 |publisher=[[LWN.net]]}}</ref> ARM's smallest processor families (Cortex M0 and M1) implement only the 16-bit Thumb instruction set for maximum performance in lowest cost applications. ARM processors that don't support 32-bit addressing also omit Thumb.

===Thumb-2===
<!-- Section header used in redirects -->
''Thumb-2'' technology was introduced in the ''ARM1156&nbsp;core'', announced in 2003. Thumb-2 extends the limited 16-bit instruction set of Thumb with additional 32-bit instructions to give the instruction set more breadth, thus producing a variable-length instruction set. A stated aim for Thumb-2 was to achieve code density similar to Thumb with performance similar to the ARM instruction set on 32-bit memory.

Thumb-2 extends the Thumb instruction set with bit-field manipulation, table branches and conditional execution. At the same time, the ARM instruction set was extended to maintain equivalent functionality in both instruction sets. A new "Unified Assembly Language" (UAL) supports generation of either Thumb or ARM instructions from the same source code; versions of Thumb seen on ARMv7 processors are essentially as capable as ARM code (including the ability to write interrupt handlers). This requires a bit of care, and use of a new "IT" (if-then) instruction, which permits up to four successive instructions to execute based on a tested condition, or on its inverse. When compiling into ARM code, this is ignored, but when compiling into Thumb it generates an actual instruction. For example:

<syntaxhighlight lang="nasm">
; if (r0 == r1)
CMP r0, r1
ITE EQ        ; ARM: no code ... Thumb: IT instruction
; then r0 = r2;
MOVEQ r0, r2  ; ARM: conditional; Thumb: condition via ITE 'T' (then)
; else r0 = r3;
MOVNE r0, r3  ; ARM: conditional; Thumb: condition via ITE 'E' (else)
; recall that the Thumb MOV instruction has no bits to encode "EQ" or "NE".
</syntaxhighlight>

All ARMv7 chips support the Thumb instruction set. All chips in the Cortex-A series that support ARMv7, all Cortex-R series, and all ARM11 series support both "ARM instruction set state" and "Thumb instruction set state", while chips in the [[ARM Cortex-M|Cortex-M]] series support only the Thumb instruction set.<ref>{{cite web |url=https://www.arm.com/products/CPUs/architecture.html |title=ARM Processor Instruction Set Architecture |publisher=ARM.com |access-date=18 April 2009 |archive-url=https://web.archive.org/web/20090415171228/http://arm.com/products/CPUs/architecture.html |archive-date=15 April 2009 |url-status=live}}</ref><ref>{{cite web |url=http://www.linuxdevices.com/news/NS7814673959.html |title=ARM aims son of Thumb at uCs, ASSPs, SoCs |publisher=Linuxdevices.com |access-date=18 April 2009 |archive-url=https://archive.today/20121209133741/http://www.linuxfordevices.com/c/a/News/ARM-aims-son-of-Thumb-at-uCs-ASSPs-SoCs/ |archive-date=9 December 2012 |url-status=dead}}</ref><ref>{{cite web |url=http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/I1005458.html |title=ARM Information Center |publisher=Infocenter.arm.com |access-date=18 April 2009}}</ref>

==={{anchor|ThumbEE}}Thumb Execution Environment (ThumbEE)===
''ThumbEE'' (erroneously called ''Thumb-2EE'' in some ARM documentation), which was marketed as Jazelle RCT<ref>{{cite web |url=https://www.arm.com/products/processors/technologies/jazelle.php |archive-url=https://web.archive.org/web/20170602084751/https://www.arm.com/products/processors/technologies/jazelle.php |title=Jazelle |publisher=ARM Ltd. |url-status=dead |archive-date=2 June 2017}}</ref> (Runtime Compilation Target), was announced in 2005 and deprecated in 2011. It first appeared in the ''Cortex-A8'' processor. ThumbEE is a fourth instruction set state, making small changes to the Thumb-2 extended instruction set. These changes make the instruction set particularly suited to code generated at runtime (e.g. by [[just-in-time compilation|JIT compilation]]) in managed ''Execution Environments''. ThumbEE is a target for languages such as [[Java (programming language)|Java]], [[C Sharp (programming language)|C#]], [[Perl]], and [[Python (programming language)|Python]], and allows [[Just-in-time compilation|JIT compilers]] to output smaller compiled code without reducing performance.{{citation needed|date=June 2020}}

New features provided by ThumbEE include automatic null pointer checks on every load and store instruction, an instruction to perform an array bounds check, and special instructions that call a handler. In addition, because it utilises Thumb-2 technology, ThumbEE provides access to registers r8–r15 (where the Jazelle/DBX Java VM state is held).<ref>{{cite web |url=https://www.arm.com/miscPDFs/10069.pdf |title=ARM strengthens Java compilers: New 16-Bit Thumb-2EE Instructions Conserve System Memory |author=Tom R. Halfhill |year=2005 |archive-url=https://web.archive.org/web/20071005161753/https://www.arm.com/miscPDFs/10069.pdf |archive-date=5 October 2007}}</ref> Handlers are small sections of frequently called code, commonly used to implement high level languages, such as allocating memory for a new object. These changes come from repurposing a handful of opcodes, and knowing the core is in the new ThumbEE state.

On 23 November 2011, Arm deprecated any use of the ThumbEE instruction set,<ref>ARM Architecture Reference Manual, Armv7-A and Armv7-R edition, issue C.b, Section A2.10, 25 July 2012.</ref> and Armv8 removes support for ThumbEE.

==={{anchor|VFP}}Floating-point (VFP)===
''VFP'' (Vector Floating Point) technology is a [[floating-point unit]] (FPU) coprocessor extension to the ARM architecture<ref>{{cite web |url=http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0473e/CHDHAGGE.html |title=ARM Compiler toolchain Using the Assembler – VFP coprocessor |publisher=ARM.com |access-date=20 August 2014}}</ref> (implemented differently in Armv8 – coprocessors not defined there). It provides low-cost [[single-precision floating-point format|single-precision]] and [[double-precision floating-point format|double-precision floating-point]] computation fully compliant with the ''[[IEEE 754|ANSI/IEEE Std 754-1985 Standard for Binary Floating-Point Arithmetic]]''. VFP provides floating-point computation suitable for a wide spectrum of applications such as PDAs, smartphones, voice compression and decompression, three-dimensional graphics and digital audio, printers, set-top boxes, and automotive applications. The VFP architecture was intended to support execution of short "vector mode" instructions but these operated on each vector element sequentially and thus did not offer the performance of true [[single instruction, multiple data]] (SIMD) vector parallelism. This vector mode was therefore removed shortly after its introduction,<ref>{{cite web |url=http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204j/Chdehgeh.html |title=VFP directives and vector notation |publisher=ARM.com |access-date=21 November 2011}}</ref> to be replaced with the much more powerful Advanced SIMD, also named [[#Advanced SIMD (Neon)|Neon]].

Some devices such as the ARM Cortex-A8 have a cut-down ''VFPLite'' module instead of a full VFP module, and require roughly ten times more clock cycles per float operation.<ref name="cortex_a9">{{cite web |url=https://www.shervinemami.info/armAssembly.html#cortex-a9 |title=Differences between ARM Cortex-A8 and Cortex-A9 |publisher=Shervin Emami |access-date=21 November 2011}}</ref> Pre-Armv8 architecture implemented floating-point/SIMD with the coprocessor interface. Other floating-point and/or SIMD units found in ARM-based processors using the coprocessor interface include [[Floating Point Accelerator|FPA]], FPE, [[MMX (instruction set)|iwMMXt]], some of which were implemented in software by trapping but could have been implemented in hardware. They provide some of the same functionality as VFP but are not [[opcode]]-compatible with it. FPA10 also provides [[extended precision]], but implements correct rounding (required by IEEE&nbsp;754) only in single precision.<ref>{{cite web |url=http://chrisacorns.computinghistory.org.uk/docs/GECPlessey/GECPlessey_FPA10DataSheet.pdf |title=FPA10 Data Sheet |author=<!--Not stated--> |date=11 June 1993 |website=chrisacorns.computinghistory.org.uk |publisher=GEC Plessey Semiconductors |access-date=26 November 2020 |quote=In relation to IEEE 754-1985, the FPA achieves conformance in single-precision arithmetic [...] Occasionally, double- and extended-precision multiplications may be produced with an error of 1 or 2 units in the least significant place of the mantissa.}}</ref>

; VFPv1: Obsolete
; VFPv2: An optional extension to the ARM instruction set in the ARMv5TE, ARMv5TEJ and ARMv6 architectures. VFPv2 has 16 64-bit FPU registers.
; VFPv3 or VFPv3-D32: Implemented on most Cortex-A8 and A9 ARMv7 processors.  It is backward-compatible with VFPv2, except that it cannot trap floating-point exceptions. VFPv3 has 32 64-bit FPU registers as standard, adds VCVT instructions to convert between scalar, float and double, adds immediate mode to VMOV such that constants can be loaded into FPU registers.
; VFPv3-D16: As above, but with only 16 64-bit FPU registers. Implemented on Cortex-R4 and R5 processors and the [[Tegra|Tegra 2]] (Cortex-A9).
; VFPv3-F16: Uncommon; it supports [[half-precision floating-point format|IEEE754-2008 half-precision (16-bit) floating point]] as a storage format.
; VFPv4 or VFPv4-D32:Implemented on Cortex-A12 and A15 ARMv7 processors, Cortex-A7 optionally has VFPv4-D32 in the case of an FPU with Neon.<ref name="VFPv4.A7">{{cite web |url=http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0464f/BABDAHCE.html |title=Cortex-A7 MPCore Technical Reference Manual – 1.3 Features |publisher=ARM |access-date=11 July 2014}}</ref> VFPv4 has 32 64-bit FPU registers as standard, adds both half-precision support as a storage format and [[fused multiply–add|fused multiply-accumulate]] instructions to the features of VFPv3.
; VFPv4-D16: As above, but it has only 16 64-bit FPU registers. Implemented on Cortex-A5 and A7 processors in the case of an FPU without Neon.<ref name="VFPv4.A7"/>
; VFPv5-D16-M: Implemented on Cortex-M7 when single and double-precision floating-point core option exists.

In [[Debian]] [[Linux]] and derivatives such as [[Ubuntu]] and [[Linux Mint]], '''armhf''' ('''ARM hard float''') refers to the ARMv7 architecture including the additional VFP3-D16 floating-point hardware extension (and Thumb-2) above. Software packages and cross-compiler tools use the armhf vs. arm/armel suffixes to differentiate.<ref>{{cite web |url=https://wiki.debian.org/ArmHardFloatPort |title=ArmHardFloatPort – Debian Wiki |publisher=Wiki.debian.org |date=20 August 2012 |access-date=8 January 2014}}</ref>

==={{anchor|NEON}}{{anchor|Advanced SIMD (NEON)}}Advanced SIMD (Neon)===
The ''Advanced SIMD'' extension (also known as ''Neon'' or "MPE" Media Processing Engine) is a combined 64- and [[128-bit computing|128-bit]] SIMD instruction set that provides standardised acceleration for media and signal processing applications. Neon is included in all Cortex-A8 devices, but is optional in Cortex-A9 devices.<ref>{{cite web |url=https://www.arm.com/products/processors/cortex-a/cortex-a9.php |title=Cortex-A9 Processor |website=arm.com |access-date=21 November 2011}}</ref> Neon can execute MP3 audio decoding on CPUs running at 10&nbsp;MHz, and can run the [[GSM]] [[adaptive Multi-Rate audio codec|adaptive multi-rate]] (AMR) speech codec at 13&nbsp;MHz. It features a comprehensive instruction set, separate register files, and independent execution hardware.<ref>{{cite web |url=http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0409f/Chdceejc.html |title=About the Cortex-A9 NEON MPE |website=arm.com |access-date=21 November 2011}}</ref> Neon supports 8-, 16-, 32-, and 64-bit integer and single-precision (32-bit) floating-point data and SIMD operations for handling audio and video processing as well as graphics and gaming processing. In Neon, the SIMD supports up to 16&nbsp;operations at the same time. The Neon hardware shares the same floating-point registers as used in VFP. Devices such as the ARM Cortex-A8 and Cortex-A9 support 128-bit vectors, but will execute with 64&nbsp;bits at a time,<ref name="cortex_a9"/> whereas newer Cortex-A15 devices can execute 128&nbsp;bits at a time.<ref>{{cite web |url=https://patents.google.com/patent/US20050125476A1/en |title=US20050125476A1}}</ref><ref>{{cite web |url=https://patents.google.com/patent/US20080141004A1/en |title=US20080141004A1}}</ref>

A quirk of Neon in Armv7 devices is that it flushes all [[subnormal number]]s to zero, and as a result the [[GNU Compiler Collection|GCC]] compiler will not use it unless {{code|-funsafe-math-optimizations}}, which allows losing denormals, is turned on. "Enhanced" Neon defined since Armv8 does not have this quirk, but as of {{nowrap|GCC 8.2}} the same flag is still required to enable Neon instructions.<ref>{{cite web |title=ARM Options |url=https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html |website=GNU Compiler Collection Manual |access-date=20 September 2019}}</ref> On the other hand, GCC does consider Neon safe on AArch64 for Armv8.

ProjectNe10 is ARM's first open-source project (from its inception; while they acquired an older project, now named [[Mbed TLS]]<!--the sub-project https://github.com/ARMmbed/mbed-crypto seems to have only x86 assembly-->). The Ne10 library is a set of common, useful functions written in both Neon and C (for compatibility). The library was created to allow developers to use Neon optimisations without learning Neon, but it also serves as a set of highly optimised Neon intrinsic and assembly code examples for common DSP, arithmetic, and image processing routines. The source code is available on GitHub.<ref>{{GitHub|projectNe10/Ne10|Ne10: An open optimized software library project for the ARM Architecture}}</ref>

==={{Anchor|ARM Helium technology}}ARM Helium technology===
Helium is the M-Profile Vector Extension (MVE). It adds more than 150 scalar and vector instructions.<ref>{{cite web |url=https://www.arm.com/-/media/Files/pdf/white-paper/armv8.1-m-architecture.pdf |title=Introduction to ARMv8.1-M architecture |author=Joseph Yiu |access-date=18 July 2022}}</ref>

===Security extensions===

===={{anchor|TrustZone}}TrustZone (for Cortex-A profile)====
The Security Extensions, marketed as TrustZone Technology, is in ARMv6KZ and later application profile architectures. It provides a low-cost alternative to adding another dedicated security core to an SoC, by providing two virtual processors backed by hardware based access control. This lets the application core switch between two states, referred to as ''worlds'' (to reduce confusion with other names for capability domains), to prevent information leaking from the more trusted world to the less trusted world.<ref>{{cite web |url=https://developer.arm.com/documentation/100935/0100/The-TrustZone-hardware-architecture- |title=The TrustZone hardware architecture |publisher=[[Arm Holdings|ARM Developer]]}}</ref> This world switch is generally orthogonal to all other capabilities of the processor, thus each world can operate independently of the other while using the same core. Memory and peripherals are then made aware of the operating world of the core and may use this to provide access control to secrets and code on the device.<ref>{{cite web |url=https://genode.org/documentation/articles/trustzone |title=Genode – An Exploration of ARM TrustZone Technology |access-date=10 July 2015}}</ref>

Typically, a rich operating system is run in the less trusted world, with smaller security-specialised code in the more trusted world, aiming to reduce the [[attack surface]].  Typical applications include [[digital rights management|DRM]] functionality for controlling the use of media on ARM-based devices,<ref>{{cite press release |url=https://news.thomasnet.com/companystory/476887 |title=ARM Announces Availability of Mobile Consumer DRM Software Solutions Based on ARM TrustZone Technology |publisher=News.thomasnet.com |access-date=18 April 2009}}</ref> and preventing any unapproved use of the device.

In practice, since the specific implementation details of proprietary TrustZone implementations have not been publicly disclosed for review, it is unclear what level of assurance is provided for a given [[threat model]], but they are not immune from attack.<ref>{{cite web |url=https://bits-please.blogspot.com/2015/08/full-trustzone-exploit-for-msm8974.html |title=Bits, Please!: Full TrustZone exploit for MSM8974 |last=Laginimaineb |date=8 October 2015 |website=Bits, Please! |access-date=3 May 2016}}</ref><ref>{{cite web |url=https://www.blackhat.com/docs/us-15/materials/us-15-Shen-Attacking-Your-Trusted-Core-Exploiting-Trustzone-On-Android.pdf |title=Attacking your 'Trusted Core' Exploiting TrustZone on Android |author=Di Shen |publisher=[[Black Hat Briefings]] |access-date=3 May 2016}}</ref>

Open Virtualization<ref>{{cite web |url=http://www.openvirtualization.org |title=ARM TrustZone and ARM Hypervisor Open Source Software |publisher=Open Virtualization |access-date=14 June 2013 |archive-url=https://web.archive.org/web/20130614081110/http://openvirtualization.org/ |archive-date=14 June 2013 |url-status=dead}}</ref> is an open source implementation of the trusted world architecture for TrustZone.

[[AMD]] has licensed and incorporated TrustZone technology into its [[AMD Platform Security Processor|Secure Processor Technology]].<ref>{{cite web |title=AMD Secure Technology |url=https://www.amd.com/en-us/innovations/software-technologies/security |website=AMD |access-date=6 July 2016 |archive-url=https://web.archive.org/web/20160723094537/https://www.amd.com/en-us/innovations/software-technologies/security |archive-date=23 July 2016}}</ref> AMD's [[AMD Accelerated Processing Unit|APU]]s include a Cortex-A5 processor for handling secure processing, which is enabled in some, but not all products.<ref>{{cite news |last1=Smith |first1=Ryan |title=AMD 2013 APUs to include ARM Cortex A5 Processor for Trustzone Capabilities |url=https://www.anandtech.com/show/6007/amd-2013-apus-to-include-arm-cortexa5-processor-for-trustzone-capabilities |access-date=6 July 2016 |website=[[AnandTech]] |date=13 June 2012}}</ref><ref name="beema">{{cite news |last1=Shimpi |first1=Anand Lal |title=AMD Beema Mullins Architecture A10 micro 6700T Performance Preview |url=https://www.anandtech.com/show/7974/amd-beema-mullins-architecture-a10-micro-6700t-performance-preview |access-date=6 July 2016 |website=[[AnandTech]] |date=29 April 2014}}</ref><ref>{{cite news |last1=Walton |first1=Jarred |title=AMD Launches Mobile Kaveri APUs |url=https://www.anandtech.com/show/8119/amd-launches-mobile-kaveri-apus |access-date=6 July 2016 |website=[[AnandTech]] |date=4 June 2014}}</ref> In fact, the Cortex-A5 TrustZone core had been included in earlier AMD products, but was not enabled due to time constraints.<ref name="beema"/>

[[Samsung Knox]] uses TrustZone for purposes such as detecting modifications to the kernel, storing certificates and attestating keys.<ref>{{cite web |url=https://docs.samsungknox.com/admin/whitepaper/kpe/hardware-backed-root-of-trust.htm |title=Root of Trust |type=white paper |date=April 2016 |publisher=[[Samsung Electronics]]}}</ref>

===={{anchor|TrustZone for ARMv8-M}}TrustZone for Armv8-M (for Cortex-M profile)====
The Security Extension, marketed as TrustZone for Armv8-M Technology, was introduced in the Armv8-M architecture. While containing similar concepts to TrustZone for Armv8-A, it has a different architectural design, as world switching is performed using branch instructions instead of using exceptions.<ref>{{cite web |url=https://developer.arm.com/documentation/100690/0100/Introduction/Secure-and-Non-secure-worlds/Relationship-between-ARM-TrustZone-technology-for-ARMv8-M-and-ARM-Cortex-A-processors?lang=en |title=Relationship between ARM TrustZone technology for ARMv8-M and ARM Cortex-A processors |publisher=[[Arm Holdings|ARM Developer]]}}</ref> It also supports safe interleaved interrupt handling from either world regardless of the current security state. Together these features provide low latency calls to the secure world and responsive interrupt handling. ARM provides a reference stack of secure world code in the form of Trusted Firmware for M and [[PSA Certified]].

===No-execute page protection===
As of ARMv6, the ARM architecture supports [[NX bit|no-execute page protection]], which is referred to as ''XN'', for ''eXecute Never''.<ref>{{cite web |quote=APX and XN (execute never) bits have been added in VMSAv6 [Virtual Memory System Architecture] |url=https://www.arm.com/miscPDFs/14128.pdf |title=ARM Architecture Reference Manual |page=B4-8 |archive-url=https://web.archive.org/web/20090206061248/http://arm.com/miscPDFs/14128.pdf |archive-date=6 February 2009}}</ref>

==={{anchor|LPAE}}Large Physical Address Extension (LPAE)===
The Large Physical Address Extension (LPAE), which extends the physical address size from 32 bits to 40 bits, was added to the Armv7-A architecture in 2011.<ref>{{cite book |title=ARM Architecture Reference Manual, ARMv7-A and ARMv7-R edition |publisher=ARM Limited}}</ref>

The physical address size may be even larger in processors based on the 64-bit (Armv8-A) architecture. For example, it is 44 bits in Cortex-A75 and Cortex-A65AE.<ref>{{cite web |url=https://developer.arm.com/ip-products/processors/cortex-a/cortex-a65ae |title=Cortex-A65AE |website=ARM Developer |access-date=26 April 2019}}</ref>

==={{anchor|ARM8-R}}Armv8-R and Armv8-M===
The '''Armv8-R''' and '''Armv8-M''' architectures, announced after the Armv8-A architecture, share some features with Armv8-A. However, Armv8-M does not include any 64-bit AArch64 instructions, and Armv8-R originally did not include any AArch64 instructions; those instructions were added to [[#Armv8-R|Armv8-R]] later.

===={{Anchor|ARMv8.1-M}}Armv8.1-M====
The Armv8.1-M architecture, announced in February 2019, is an enhancement of the Armv8-M architecture. It brings new features including:
* A new vector instruction set extension. The M-Profile Vector Extension (MVE), or Helium, is for signal processing and machine learning applications.
* Additional instruction set enhancements for loops and branches (Low Overhead Branch Extension).
* Instructions for [[half-precision floating-point format|half-precision floating-point]] support.
* Instruction set enhancement for TrustZone management for Floating Point Unit (FPU).
* New memory attribute in the Memory Protection Unit (MPU).
* Enhancements in debug including Performance Monitoring Unit (PMU), Unprivileged Debug Extension, and additional debug support focus on signal processing application developments.
* Reliability, Availability and Serviceability (RAS) extension.