Editing CPU cache (section)

=={{Anchor|Cache hierarchy}}Cache hierarchy in a modern processor==
[[File:Hwloc.png|thumb|right|upright=3|Memory hierarchy of an AMD Bulldozer server]]

Modern processors have multiple interacting on-chip caches. The operation of a particular cache can be completely specified by the cache size, the cache block size, the number of blocks in a set, the cache set replacement policy, and the cache write policy (write-through or write-back).<ref name="ccs.neu.edu" />

While all of the cache blocks in a particular cache are the same size and have the same associativity, typically the "lower-level" caches (called Level 1 cache) have a smaller number of blocks, smaller block size, and fewer blocks in a set, but have very short access times. "Higher-level" caches (i.e. Level 2 and above) have progressively larger numbers of blocks, larger block size, more blocks in a set, and relatively longer access times, but are still much faster than main memory.

Cache entry replacement policy is determined by a [[cache algorithm]] selected to be implemented by the processor designers. In some cases, multiple algorithms are provided for different kinds of work loads.

===Specialized caches===
Pipelined CPUs access memory from multiple points in the [[Instruction pipeline|pipeline]]: instruction fetch, [[virtual memory|virtual-to-physical]] address translation, and data fetch (see [[classic RISC pipeline]]). The natural design is to use different physical caches for each of these points, so that no one physical resource has to be scheduled to service two points in the pipeline. Thus the pipeline naturally ends up with at least three separate caches (instruction, [[translation lookaside buffer|TLB]], and data), each specialized to its particular role.

====Victim cache====
{{Main article|Victim cache}}
A '''victim cache''' is a cache used to hold blocks evicted from a CPU cache upon replacement. The victim cache lies between the main cache and its refill path, and holds only those blocks of data that were evicted from the main cache. The victim cache is usually fully associative, and is intended to reduce the number of conflict misses. Many commonly used programs do not require an associative mapping for all the accesses. In fact, only a small fraction of the memory accesses of the program require high associativity. The victim cache exploits this property by providing high associativity to only these accesses. It was introduced by [[Norman Jouppi]] from DEC in 1990.<ref name=Jouppi1990>{{cite conference 
 |last=Jouppi |first=Norman P.
 |date=May 1990
 |title=Improving direct-mapped cache performance by the addition of a small {{Sic|hide=y|fully|-}}associative cache and prefetch buffers
 |pages=364–373
 |book-title=Conference Proceedings of the 17th Annual International Symposium on Computer Architecture
 |conference=17th Annual International Symposium on Computer Architecture, May 28-31, 1990
 |location=Seattle, WA, USA
 |doi=10.1109/ISCA.1990.134547
}}</ref>

Intel's ''[[Crystalwell]]''<ref name="intel-ark-crystal-well">{{cite web
 | url = http://ark.intel.com/products/codename/51802/Crystal-Well
 | title = Products (Formerly Crystal Well)
 | publisher = [[Intel]]
 | access-date = 2013-09-15
}}</ref> variant of its [[Haswell (microarchitecture)|Haswell]] processors introduced an on-package 128&nbsp;MiB [[eDRAM]] Level 4 cache which serves as a victim cache to the processors' Level 3 cache.<ref name="anandtech-i74950hq">{{cite web
 | url = http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/3
 | title =  Intel Iris Pro 5200 Graphics Review: Core i7-4950HQ Tested
 | publisher = [[AnandTech]]
 | access-date = 2013-09-16
}}</ref> In the [[Skylake (microarchitecture)|Skylake]] microarchitecture the Level 4 cache no longer works as a victim cache.<ref>{{cite web |author=Cutress |first=Ian |date=September 2, 2015 |title=The Intel Skylake Mobile and Desktop Launch, with Architecture Analysis |url=http://www.anandtech.com/show/9582/intel-skylake-mobile-desktop-launch-architecture-analysis/5 |publisher=AnandTech}}</ref>

===={{Anchor|TRACE-CACHE}}Trace cache====
{{Main article|Trace cache}}

One of the more extreme examples of cache specialization is the '''trace cache''' (also known as ''execution trace cache'') found in the [[Intel]] [[Pentium&nbsp;4]] microprocessors. A trace cache is a mechanism for increasing the instruction fetch bandwidth and decreasing power consumption (in the case of the Pentium&nbsp;4) by storing traces of [[instruction (computer science)|instruction]]s that have already been fetched and decoded.<ref>{{cite web |author=Shimpi |first=Anand Lal |date=2000-11-20 |title=The Pentium 4's Cache – Intel Pentium&nbsp;4 1.4&nbsp;GHz & 1.5&nbsp;GHz |url=http://www.anandtech.com/show/661/5 |access-date=2015-11-30 |publisher=[[AnandTech]]}}</ref>

A trace cache stores instructions either after they have been decoded, or as they are retired. Generally, instructions are added to trace caches in groups representing either individual [[basic block]]s or dynamic instruction traces. The Pentium&nbsp;4's trace cache stores [[micro-operations]] resulting from decoding x86 instructions, providing also the functionality of a micro-operation cache. Having this, the next time an instruction is needed, it does not have to be decoded into micro-ops again.<ref name="agner.org" />{{rp|63&ndash;68}}

====Write Coalescing Cache (WCC)====
Write Coalescing Cache<ref>{{cite web |author=Kanter |first=David |date=August 26, 2010 |title=AMD's Bulldozer Microarchitecture – Memory Subsystem Continued |url=http://www.realworldtech.com/bulldozer/9/ |website=Real World Technologies}}</ref> is a special cache that is part of L2 cache in [[AMD]]'s [[Bulldozer (microarchitecture)|Bulldozer microarchitecture]]. Stores from both L1D caches in the module go through the WCC, where they are buffered and coalesced.
The WCC's task is reducing number of writes to the L2 cache.

===={{Anchor|UOP-CACHE}}Micro-operation (μop or uop) cache====
A '''micro-operation cache''' ('''μop cache''', '''uop cache''' or '''UC''')<ref>{{cite web |author=Kanter |first=David |date=September 25, 2010 |title=Intel's Sandy Bridge Microarchitecture – Instruction Decode and uop Cache |url=http://www.realworldtech.com/sandy-bridge/4/ |website=Real World Technologies}}</ref> is a specialized cache that stores [[micro-operation]]s of decoded instructions, as received directly from the [[instruction decoder]]s or from the instruction cache. When an instruction needs to be decoded, the μop cache is checked for its decoded form which is re-used if cached; if it is not available, the instruction is decoded and then cached.

One of the early works describing μop cache as an alternative frontend for the Intel [[P6 (microarchitecture)|P6 processor family]] is the 2001 paper ''"Micro-Operation Cache: A Power Aware Frontend for Variable Instruction Length ISA"''.<ref name="uop-intel">{{cite conference |conference=2001 International Symposium on Low Power Electronics and Design (ISLPED'01), August 6-7, 2001 |location=Huntington Beach, CA, USA |last1=Solomon |first1=Baruch |book-title=ISLPED'01: Proceedings of the 2001 International Symposium on Low Power Electronics and Design |last2=Mendelson |first2=Avi |last3=Orenstein |first3=Doron |last4=Almog |first4=Yoav |last5=Ronen |first5=Ronny |date=August 2001 |publisher=[[Association for Computing Machinery]] |isbn=978-1-58113-371-4 |pages=4–9 |title=Micro-Operation Cache: A Power Aware Frontend for Variable Instruction Length ISA |doi=10.1109/LPE.2001.945363 |access-date=2013-10-06 |url=http://cecs.uci.edu/~papers/compendium94-03/papers/2001/islped01/pdffiles/p004.pdf |s2cid=195859085}}</ref> Later, Intel included μop caches in its [[Sandy Bridge]] processors and in successive microarchitectures like [[Ivy Bridge (microarchitecture)|Ivy Bridge]] and [[Haswell (microarchitecture)|Haswell]].<ref name="agner.org">{{cite web |author=Fog |first=Agner |author-link=Agner Fog |date=2014-02-19 |title=The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers |url=http://www.agner.org/optimize/microarchitecture.pdf |access-date=2014-03-21 |website=agner.org}}</ref>{{rp|121&ndash;123}}<ref name="anandtech-haswell">{{cite web |author=Shimpi |first=Anand Lal |date=2012-10-05 |title=Intel's Haswell Architecture Analyzed |url=http://www.anandtech.com/show/6355/intels-haswell-architecture/6 |access-date=2013-10-20 |publisher=[[AnandTech]]}}</ref> AMD implemented a μop cache in their [[Zen (microarchitecture)|Zen microarchitecture]].<ref>{{cite web |author=Cutress |first=Ian |date=2016-08-18 |title=AMD Zen Microarchitecture: Dual Schedulers, Micro-Op Cache and Memory Hierarchy Revealed |url=http://www.anandtech.com/show/10578/amd-zen-microarchitecture-dual-schedulers-micro-op-cache-memory-hierarchy-revealed |access-date=2017-04-03 |publisher=AnandTech}}</ref>

Fetching complete pre-decoded instructions eliminates the need to repeatedly decode variable length complex instructions into simpler fixed-length micro-operations, and simplifies the process of predicting, fetching, rotating and aligning fetched instructions. A μop cache effectively offloads the fetch and decode hardware, thus decreasing [[power consumption]] and improving the frontend supply of decoded micro-operations. The μop cache also increases performance by more consistently delivering decoded micro-operations to the backend and eliminating various bottlenecks in the CPU's fetch and decode logic.<ref name="uop-intel" /><ref name="anandtech-haswell" />

A μop cache has many similarities with a trace cache, although a μop cache is much simpler thus providing better power efficiency; this makes it better suited for implementations on battery-powered devices. The main disadvantage of the trace cache, leading to its power inefficiency, is the hardware complexity required for its [[heuristic]] deciding on caching and reusing dynamically created instruction traces.<ref name="tc-slides">{{cite web |last1=Gu |first1=Leon |last2=Motiani |first2=Dipti |date=October 2003 |title=Trace Cache |url=https://www.cs.cmu.edu/afs/cs/academic/class/15740-f03/www/lectures/TraceCache_slides.pdf |access-date=2013-10-06}}</ref>

====Branch target instruction cache====
A '''branch target cache''' or '''branch target instruction cache''', the name used on [[ARM microprocessors]],<ref>{{cite web |author=Niu |first=Kun |date=28 May 2015 |title=How does the BTIC (branch target instruction cache) work? |url=https://community.arm.com/processors/f/discussions/5320/how-does-the-btic-branch-target-instruction-cache-works |access-date=7 April 2018}}</ref> is a specialized cache which holds the first few instructions at the destination of a taken branch. This is used by low-powered processors which do not need a normal instruction cache because the memory system is capable of delivering instructions fast enough to satisfy the CPU without one. However, this only applies to consecutive instructions in sequence; it still takes several cycles of latency to restart instruction fetch at a new address, causing a few cycles of pipeline bubble after a control transfer. A branch target cache provides instructions for those few cycles avoiding a delay after most taken branches.

This allows full-speed operation with a much smaller cache than a traditional full-time instruction cache.

====Smart cache====
'''Smart cache''' is a [[#MULTILEVEL|level 2]] or [[#MULTILEVEL|level 3]] caching method for multiple execution cores, developed by [[Intel]].

Smart Cache shares the actual cache memory between the cores of a [[multi-core processor]]. In comparison to a dedicated per-core cache, the overall [[cache miss]] rate decreases when cores do not require equal parts of the cache space. Consequently, a single core can use the full level 2 or level 3 cache while the other cores are inactive.<ref>{{cite web|url=http://www.intel.com/content/www/us/en/architecture-and-technology/intel-smart-cache.html|title=Intel Smart Cache: Demo|publisher=[[Intel]]|access-date=2012-01-26}}</ref> Furthermore, the shared cache makes it faster to share memory among different execution cores.<ref>{{cite web |url=http://software.intel.com/file/18374/ |archive-url=https://web.archive.org/web/20111229193036/http://software.intel.com/file/18374/ |title=Inside Intel Core Microarchitecture and Smart Memory Access |format=PDF |page=5 |publisher=[[Intel]] |year=2006 |access-date=2012-01-26 |archive-date=2011-12-29 |url-status=dead}}</ref>

==={{Anchor|MULTILEVEL}}Multi-level caches===
{{See also|Cache hierarchy}}
Another issue is the fundamental tradeoff between cache latency and hit rate. Larger caches have better hit rates but longer latency. To address this tradeoff, many computers use multiple levels of cache, with small fast caches backed up by larger, slower caches. Multi-level caches generally operate by checking the fastest but smallest cache, ''level 1'' ('''L1'''), first; if it hits, the processor proceeds at high speed. If that cache misses, the slower but larger next level cache, ''level 2'' ('''L2'''), is checked, and so on, before accessing external memory.

As the latency difference between main memory and the fastest cache has become larger, some processors have begun to utilize as many as three levels of on-chip cache. Price-sensitive designs used this to pull the entire cache hierarchy on-chip, but by the 2010s some of the highest-performance designs returned to having large off-chip caches, which is often implemented in [[eDRAM]] and mounted on a [[multi-chip module]], as a fourth cache level. In rare cases, such as in the mainframe CPU [[IBM z15 (microprocessor)|IBM z15]] (2019), all levels down to L1 are implemented by eDRAM, replacing [[static random-access memory|SRAM]] entirely (for cache, SRAM is still used for registers{{cn|date=May 2025}}). [[Apple Inc|Apple's]] [[ARM architecture family|ARM-based]] [[Apple silicon]] series, starting with the [[Apple A14|A14]] and [[Apple M1|M1]], have a 192&nbsp;KiB L1i cache for each of the high-performance cores, an unusually large amount; however the high-efficiency cores only have 128&nbsp;KiB. Since then other processors such as [[Intel]]'s [[Lunar Lake]] and [[Qualcomm]]'s [[Oryon]] have also implemented similar L1i cache sizes.

The benefits of L3 and L4 caches depend on the application's access patterns. Examples of products incorporating L3 and L4 caches include the following:

* [[Alpha 21164]] (1995) had 1 to 64&nbsp;MiB off-chip L3 cache.
* [[AMD K6-III]] (1999) had motherboard-based L3 cache.
* IBM [[POWER4]] (2001) had off-chip L3 caches of 32&nbsp;MiB per processor, shared among several processors.
* [[Itanium 2]] (2003) had a 6&nbsp;MiB [[unified cache|unified]] level 3 (L3) cache on-die; the [[Itanium 2]] (2003) MX&nbsp;2 module incorporated two Itanium&nbsp;2 processors along with a shared 64&nbsp;MiB L4 cache on a [[multi-chip module]] that was pin compatible with a Madison processor.
* Intel's [[Xeon]] MP product codenamed "Tulsa" (2006) features 16&nbsp;MiB of on-die L3 cache shared between two processor cores.
* [[AMD Phenom]] (2007) with 2&nbsp;MiB of L3 cache.
* AMD [[Phenom II]] (2008) has up to 6&nbsp;MiB on-die unified L3 cache.
* [[List of Intel Core i7 processors|Intel Core i7]] (2008) has an 8&nbsp;MiB on-die unified L3 cache that is inclusive, shared by all cores.
* Intel [[Haswell (microarchitecture)|Haswell]] CPUs with integrated [[Intel Iris Pro Graphics]] have 128&nbsp;MiB of eDRAM acting essentially as an L4 cache.<ref>{{cite web|url=http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/3 |title=Intel Iris Pro 5200 Graphics Review: Core i7-4950HQ Tested |publisher=AnandTech |access-date=2014-02-25}}</ref>

Finally, at the other end of the memory hierarchy, the CPU [[register file]] itself can be considered the smallest, fastest cache in the system, with the special characteristic that it is scheduled in software—typically by a compiler, as it allocates registers to hold values retrieved from main memory for, as an example, [[loop nest optimization]]. However, with [[register renaming]] most compiler register assignments are reallocated dynamically by hardware at runtime into a register bank, allowing the CPU to break false data dependencies and thus easing pipeline hazards.

Register files sometimes also have hierarchy: The [[Cray-1]] (circa 1976) had eight address "A" and eight scalar data "S" registers that were generally usable. There was also a set of 64 address "B" and 64 scalar data "T" registers that took longer to access, but were faster than main memory. The "B" and "T" registers were provided because the Cray-1 did not have a data cache. (The Cray-1 did, however, have an instruction cache.)

===={{Anchor|LLC}}Multi-core chips====
When considering a chip with [[Multi-core processor|multiple cores]], there is a question of whether the caches should be shared or local to each core. Implementing shared cache inevitably introduces more wiring and complexity. But then, having one cache per ''chip'', rather than ''core'', greatly reduces the amount of space needed, and thus one can include a larger cache.

Typically, sharing the L1 cache is undesirable because the resulting increase in latency would make each core run considerably slower than a single-core chip. However, for the highest-level cache, the last one called before accessing memory, having a global cache is desirable for several reasons, such as allowing a single core to use the whole cache, reducing data redundancy by making it possible for different processes or threads to share cached data, and reducing the complexity of utilized cache coherency protocols.<ref>{{cite web |last1=Tian |first1=Tian |last2=Shih |first2=Chiu-Pi |date=2012-03-08 |title=Software Techniques for Shared-Cache Multi-Core Systems |url=https://software.intel.com/en-us/articles/software-techniques-for-shared-cache-multi-core-systems |access-date=2015-11-24 |publisher=[[Intel]]}}</ref> For example, an eight-core chip with three levels may include an L1 cache for each core, one intermediate L2 cache for each pair of cores, and one L3 cache shared between all cores.

A shared highest-level cache, which is called before accessing memory, is usually referred to as a ''last level cache'' (LLC). Additional techniques are used for increasing the level of parallelism when LLC is shared between multiple cores, including slicing it into multiple pieces which are addressing certain ranges of memory addresses, and can be accessed independently.<ref>{{cite web |author=Lempel |first=Oded |date=2013-07-28 |title=2nd Generation Intel Core Processor Family: Intel Core i7, i5 and i3 |url=http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.19.9-Desktop-CPUs/HC23.19.911-Sandy-Bridge-Lempel-Intel-Rev%207.pdf |url-status=dead |archive-url=https://web.archive.org/web/20200729000210/http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.19.9-Desktop-CPUs/HC23.19.911-Sandy-Bridge-Lempel-Intel-Rev%207.pdf |archive-date=2020-07-29 |access-date=2014-01-21 |website=hotchips.org |pages=7&ndash;10, 31&ndash;45}}</ref>

====Separate versus unified====
In a separate cache structure, instructions and data are cached separately, meaning that a cache line is used to cache either instructions or data, but not both; various benefits have been demonstrated with separate data and instruction [[translation lookaside buffer]]s.<ref>{{cite journal |author1=Chen, J. Bradley |author2=Borg, Anita |author3=Jouppi, Norman P. |title=A Simulation Based Study of TLB Performance |journal=ACM SIGARCH Computer Architecture News |volume=20 |issue=2 |year=1992 |page=114&ndash;123 |doi=10.1145/146628.139708|doi-access=free }}</ref> In a unified structure, this constraint is not present, and cache lines can be used to cache both instructions and data.

===={{Anchor|INCLUSIVE|EXCLUSIVE}}Exclusive versus inclusive====
Multi-level caches introduce new design decisions. For instance, in some processors, all data in the L1 cache must also be somewhere in the L2 cache. These caches are called ''strictly inclusive''. Other processors (like the [[AMD Athlon]]) have ''exclusive'' caches: data are guaranteed to be in at most one of the L1 and L2 caches, never in both. Still other processors (like the Intel [[Pentium II]], [[Pentium III|III]], and [[Pentium 4|4]]) do not require that data in the L1 cache also reside in the L2 cache, although it may often do so. There is no universally accepted name for this intermediate policy;<ref>{{cite web
 | url = http://www.amecomputers.com/explanation-of-the-l1-and-l2-cache.html
 | title = Explanation of the L1 and L2 Cache
 | access-date = 2014-06-09
 | website = amecomputers.com
 | archive-date = 2014-07-14
 | archive-url = https://web.archive.org/web/20140714181050/http://www.amecomputers.com/explanation-of-the-l1-and-l2-cache.html
 | url-status = dead
 }}</ref><ref name="ispass04">{{cite conference |last1=Zheng |first1=Ying |last2=Davis |first2=Brian T. |last3=Jordan |first3=Matthew |date=10–12 March 2004 |title=Performance Evaluation of Exclusive Cache Hierarchies |url=http://mercury.pr.erau.edu/~davisb22/papers/ispass04.pdf |conference=IEEE International Symposium on Performance Analysis of Systems and Software |location=Austin, Texas, USA |pages=89–96 |doi=10.1109/ISPASS.2004.1291359 |isbn=0-7803-8385-0 |archive-url=https://web.archive.org/web/20120813003941/http://mercury.pr.erau.edu/~davisb22/papers/ispass04.pdf |archive-date=2012-08-13 |access-date=2014-06-09 |url-status=dead}}</ref> two common names are "non-exclusive" and "partially-inclusive".

The advantage of exclusive caches is that they store more data. This advantage is larger when the exclusive L1 cache is comparable to the L2 cache, and diminishes if the L2 cache is many times larger than the L1 cache. When the L1 misses and the L2 hits on an access, the hitting cache line in the L2 is exchanged with a line in the L1. This exchange is quite a bit more work than just copying a line from L2 to L1, which is what an inclusive cache does.<ref name="ispass04" />

One advantage of strictly inclusive caches is that when external devices or other processors in a multiprocessor system wish to remove a cache line from the processor, they need only have the processor check the L2 cache. In cache hierarchies which do not enforce inclusion, the L1 cache must be checked as well. As a drawback, there is a correlation between the associativities of L1 and L2 caches: if the L2 cache does not have at least as many ways as all L1 caches together, the effective associativity of the L1 caches is restricted. Another disadvantage of inclusive cache is that whenever there is an eviction in L2 cache, the (possibly) corresponding lines in L1 also have to get evicted in order to maintain inclusiveness. This is quite a bit of work, and would result in a higher L1 miss rate.<ref name="ispass04" />

Another advantage of inclusive caches is that the larger cache can use larger cache lines, which reduces the size of the secondary cache tags. (Exclusive caches require both caches to have the same size cache lines, so that cache lines can be swapped on a L1 miss, L2 hit.) If the secondary cache is an order of magnitude larger than the primary, and the cache data are an order of magnitude larger than the cache tags, this tag area saved can be comparable to the incremental area needed to store the L1 cache data in the L2.<ref>{{cite web |author1=Jaleel |first=Aamer |author2=Eric Borch |last3=Bhandaru |first3=Malini |last4=Steely Jr. |first4=Simon C. |last5=Emer |first5=Joel |date=2010-09-27 |title=Achieving Non-Inclusive Cache Performance with Inclusive Caches |url=http://www.jaleels.org/ajaleel/publications/micro2010-tla.pdf |access-date=2014-06-09 |website=jaleels.org}}</ref>

===Scratchpad memory===

{{Main|Scratchpad memory}}
[[Scratchpad memory]] (SPM), also known as scratchpad, scratchpad RAM or local store in computer terminology, is a high-speed internal memory used for temporary storage of calculations, data, and other work in progress.

===Example: the K8===
To illustrate both specialization and multi-level caching, here is the cache hierarchy of the K8 core in the AMD [[Athlon 64]] CPU.<ref>{{cite web|url=http://www.sandpile.org/impl/k8.htm |title=AMD K8 |access-date=2007-06-02 |website=Sandpile.org |url-status=dead |archive-url=https://web.archive.org/web/20070515052223/http://www.sandpile.org/impl/k8.htm |archive-date=2007-05-15 }}</ref>

[[File:Cache,hierarchy-example.svg|thumb|center|upright=1.6|Cache hierarchy of the K8 core in the AMD Athlon 64 CPU]]

The K8 has four specialized caches: an instruction cache, an instruction [[translation lookaside buffer|TLB]], a data TLB, and a data cache. Each of these caches is specialized:

* The instruction cache keeps copies of 64-byte lines of memory, and fetches 16 bytes each cycle. Each byte in this cache is stored in ten bits rather than eight, with the extra bits marking the boundaries of instructions (this is an example of predecoding). The cache has only [[parity bit|parity]] protection rather than [[Error-correcting code|ECC]], because parity is smaller and any damaged data can be replaced by fresh data fetched from memory (which always has an up-to-date copy of instructions).
* The instruction TLB keeps copies of page table entries (PTEs). Each cycle's instruction fetch has its virtual address translated through this TLB into a physical address. Each entry is either four or eight bytes in memory. Because the K8 has a variable page size, each of the TLBs is split into two sections, one to keep PTEs that map 4&nbsp;KiB pages, and one to keep PTEs that map 4&nbsp;MiB or 2&nbsp;MiB pages. The split allows the fully associative match circuitry in each section to be simpler. The operating system maps different sections of the virtual address space with different size PTEs.
* The data TLB has two copies which keep identical entries. The two copies allow two data accesses per cycle to translate virtual addresses to physical addresses. Like the instruction TLB, this TLB is split into two kinds of entries.
* The data cache keeps copies of 64-byte lines of memory. It is split into 8 banks (each storing 8&nbsp;KiB of data), and can fetch two 8-byte data each cycle so long as those data are in different banks. There are two copies of the tags, because each 64-byte line is spread among all eight banks. Each tag copy handles one of the two accesses per cycle.

The K8 also has multiple-level caches. There are second-level instruction and data TLBs, which store only PTEs mapping 4&nbsp;KiB. Both instruction and data caches, and the various TLBs, can fill from the large '''unified''' L2 cache. This cache is exclusive to both the L1 instruction and data caches, which means that any 8-byte line can only be in one of the L1 instruction cache, the L1 data cache, or the L2 cache. It is, however, possible for a line in the data cache to have a PTE which is also in one of the TLBs—the operating system is responsible for keeping the TLBs coherent by flushing portions of them when the page tables in memory are updated.

The K8 also caches information that is never stored in memory—prediction information. These caches are not shown in the above diagram. As is usual for this class of CPU, the K8 has fairly complex
[[branch prediction]], with tables that help predict whether branches are taken and other tables which predict the targets of branches and jumps. Some of this information is associated with instructions, in both the level 1 instruction cache and the unified secondary cache.

The K8 uses an interesting trick to store prediction information with instructions in the secondary cache. Lines in the secondary cache are protected from accidental data corruption (e.g. by an [[alpha particle]] strike) by either [[Error-correcting code|ECC]] or [[parity (telecommunication)|parity]], depending on whether those lines were evicted from the data or instruction primary caches. Since the parity code takes fewer bits than the ECC code, lines from the instruction cache have a few spare bits. These bits are used to cache branch prediction information associated with those instructions. The net result is that the branch predictor has a larger effective history table, and so has better accuracy.

===More hierarchies===
<!-- (This section should be rewritten.) -->
Other processors have other kinds of predictors (e.g., the store-to-load bypass predictor in the [[Digital Equipment Corporation|DEC]] [[Alpha 21264]]), and various specialized predictors are likely to flourish in future processors.

These predictors are caches in that they store information that is costly to compute. Some of the terminology used when discussing predictors is the same as that for caches (one speaks of a '''hit''' in a branch predictor), but predictors are not generally thought of as part of the cache hierarchy.

The K8 keeps the instruction and data caches '''[[cache coherency|coherent]]''' in hardware, which means that a store into an instruction closely following the store instruction will change that following instruction. Other processors, like those in the Alpha and MIPS family, have relied on software to keep the instruction cache coherent. Stores are not guaranteed to show up in the instruction stream until a program calls an operating system facility to ensure coherency.

===Tag RAM===
[[File:Medion 9901 - Intel Pentium III SL35E - TagRAM chip SL3F5-1387.jpg|thumb|Tag RAM on board of a [[Pentium III|Intel Pentium III]] ]]
In computer engineering, a ''tag RAM'' is used to specify which of the possible memory locations is currently stored in a CPU cache.<ref>{{cite web
 | url = http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0363g/Chdijaed.html
 | title = Cortex-R4 and Cortex-R4F Technical Reference Manual
 | access-date = 2013-09-28
 | publisher = arm.com
}}</ref><ref>{{cite web
 | url = http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0284g/Ebddefci.html
 | title = L210 Cache Controller Technical Reference Manual
 | access-date = 2013-09-28
 | publisher = arm.com
}}</ref> For a simple, direct-mapped design fast [[static random-access memory|SRAM]] can be used. Higher [[#Associativity|associative caches]] usually employ [[content-addressable memory]].