Editing CPU cache (section)

==={{Anchor|MULTILEVEL}}Multi-level caches===
{{See also|Cache hierarchy}}
Another issue is the fundamental tradeoff between cache latency and hit rate. Larger caches have better hit rates but longer latency. To address this tradeoff, many computers use multiple levels of cache, with small fast caches backed up by larger, slower caches. Multi-level caches generally operate by checking the fastest but smallest cache, ''level 1'' ('''L1'''), first; if it hits, the processor proceeds at high speed. If that cache misses, the slower but larger next level cache, ''level 2'' ('''L2'''), is checked, and so on, before accessing external memory.

As the latency difference between main memory and the fastest cache has become larger, some processors have begun to utilize as many as three levels of on-chip cache. Price-sensitive designs used this to pull the entire cache hierarchy on-chip, but by the 2010s some of the highest-performance designs returned to having large off-chip caches, which is often implemented in [[eDRAM]] and mounted on a [[multi-chip module]], as a fourth cache level. In rare cases, such as in the mainframe CPU [[IBM z15 (microprocessor)|IBM z15]] (2019), all levels down to L1 are implemented by eDRAM, replacing [[static random-access memory|SRAM]] entirely (for cache, SRAM is still used for registers{{cn|date=May 2025}}). [[Apple Inc|Apple's]] [[ARM architecture family|ARM-based]] [[Apple silicon]] series, starting with the [[Apple A14|A14]] and [[Apple M1|M1]], have a 192&nbsp;KiB L1i cache for each of the high-performance cores, an unusually large amount; however the high-efficiency cores only have 128&nbsp;KiB. Since then other processors such as [[Intel]]'s [[Lunar Lake]] and [[Qualcomm]]'s [[Oryon]] have also implemented similar L1i cache sizes.

The benefits of L3 and L4 caches depend on the application's access patterns. Examples of products incorporating L3 and L4 caches include the following:

* [[Alpha 21164]] (1995) had 1 to 64&nbsp;MiB off-chip L3 cache.
* [[AMD K6-III]] (1999) had motherboard-based L3 cache.
* IBM [[POWER4]] (2001) had off-chip L3 caches of 32&nbsp;MiB per processor, shared among several processors.
* [[Itanium 2]] (2003) had a 6&nbsp;MiB [[unified cache|unified]] level 3 (L3) cache on-die; the [[Itanium 2]] (2003) MX&nbsp;2 module incorporated two Itanium&nbsp;2 processors along with a shared 64&nbsp;MiB L4 cache on a [[multi-chip module]] that was pin compatible with a Madison processor.
* Intel's [[Xeon]] MP product codenamed "Tulsa" (2006) features 16&nbsp;MiB of on-die L3 cache shared between two processor cores.
* [[AMD Phenom]] (2007) with 2&nbsp;MiB of L3 cache.
* AMD [[Phenom II]] (2008) has up to 6&nbsp;MiB on-die unified L3 cache.
* [[List of Intel Core i7 processors|Intel Core i7]] (2008) has an 8&nbsp;MiB on-die unified L3 cache that is inclusive, shared by all cores.
* Intel [[Haswell (microarchitecture)|Haswell]] CPUs with integrated [[Intel Iris Pro Graphics]] have 128&nbsp;MiB of eDRAM acting essentially as an L4 cache.<ref>{{cite web|url=http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/3 |title=Intel Iris Pro 5200 Graphics Review: Core i7-4950HQ Tested |publisher=AnandTech |access-date=2014-02-25}}</ref>

Finally, at the other end of the memory hierarchy, the CPU [[register file]] itself can be considered the smallest, fastest cache in the system, with the special characteristic that it is scheduled in software—typically by a compiler, as it allocates registers to hold values retrieved from main memory for, as an example, [[loop nest optimization]]. However, with [[register renaming]] most compiler register assignments are reallocated dynamically by hardware at runtime into a register bank, allowing the CPU to break false data dependencies and thus easing pipeline hazards.

Register files sometimes also have hierarchy: The [[Cray-1]] (circa 1976) had eight address "A" and eight scalar data "S" registers that were generally usable. There was also a set of 64 address "B" and 64 scalar data "T" registers that took longer to access, but were faster than main memory. The "B" and "T" registers were provided because the Cray-1 did not have a data cache. (The Cray-1 did, however, have an instruction cache.)

===={{Anchor|LLC}}Multi-core chips====
When considering a chip with [[Multi-core processor|multiple cores]], there is a question of whether the caches should be shared or local to each core. Implementing shared cache inevitably introduces more wiring and complexity. But then, having one cache per ''chip'', rather than ''core'', greatly reduces the amount of space needed, and thus one can include a larger cache.

Typically, sharing the L1 cache is undesirable because the resulting increase in latency would make each core run considerably slower than a single-core chip. However, for the highest-level cache, the last one called before accessing memory, having a global cache is desirable for several reasons, such as allowing a single core to use the whole cache, reducing data redundancy by making it possible for different processes or threads to share cached data, and reducing the complexity of utilized cache coherency protocols.<ref>{{cite web |last1=Tian |first1=Tian |last2=Shih |first2=Chiu-Pi |date=2012-03-08 |title=Software Techniques for Shared-Cache Multi-Core Systems |url=https://software.intel.com/en-us/articles/software-techniques-for-shared-cache-multi-core-systems |access-date=2015-11-24 |publisher=[[Intel]]}}</ref> For example, an eight-core chip with three levels may include an L1 cache for each core, one intermediate L2 cache for each pair of cores, and one L3 cache shared between all cores.

A shared highest-level cache, which is called before accessing memory, is usually referred to as a ''last level cache'' (LLC). Additional techniques are used for increasing the level of parallelism when LLC is shared between multiple cores, including slicing it into multiple pieces which are addressing certain ranges of memory addresses, and can be accessed independently.<ref>{{cite web |author=Lempel |first=Oded |date=2013-07-28 |title=2nd Generation Intel Core Processor Family: Intel Core i7, i5 and i3 |url=http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.19.9-Desktop-CPUs/HC23.19.911-Sandy-Bridge-Lempel-Intel-Rev%207.pdf |url-status=dead |archive-url=https://web.archive.org/web/20200729000210/http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.19.9-Desktop-CPUs/HC23.19.911-Sandy-Bridge-Lempel-Intel-Rev%207.pdf |archive-date=2020-07-29 |access-date=2014-01-21 |website=hotchips.org |pages=7&ndash;10, 31&ndash;45}}</ref>

====Separate versus unified====
In a separate cache structure, instructions and data are cached separately, meaning that a cache line is used to cache either instructions or data, but not both; various benefits have been demonstrated with separate data and instruction [[translation lookaside buffer]]s.<ref>{{cite journal |author1=Chen, J. Bradley |author2=Borg, Anita |author3=Jouppi, Norman P. |title=A Simulation Based Study of TLB Performance |journal=ACM SIGARCH Computer Architecture News |volume=20 |issue=2 |year=1992 |page=114&ndash;123 |doi=10.1145/146628.139708|doi-access=free }}</ref> In a unified structure, this constraint is not present, and cache lines can be used to cache both instructions and data.

===={{Anchor|INCLUSIVE|EXCLUSIVE}}Exclusive versus inclusive====
Multi-level caches introduce new design decisions. For instance, in some processors, all data in the L1 cache must also be somewhere in the L2 cache. These caches are called ''strictly inclusive''. Other processors (like the [[AMD Athlon]]) have ''exclusive'' caches: data are guaranteed to be in at most one of the L1 and L2 caches, never in both. Still other processors (like the Intel [[Pentium II]], [[Pentium III|III]], and [[Pentium 4|4]]) do not require that data in the L1 cache also reside in the L2 cache, although it may often do so. There is no universally accepted name for this intermediate policy;<ref>{{cite web
 | url = http://www.amecomputers.com/explanation-of-the-l1-and-l2-cache.html
 | title = Explanation of the L1 and L2 Cache
 | access-date = 2014-06-09
 | website = amecomputers.com
 | archive-date = 2014-07-14
 | archive-url = https://web.archive.org/web/20140714181050/http://www.amecomputers.com/explanation-of-the-l1-and-l2-cache.html
 | url-status = dead
 }}</ref><ref name="ispass04">{{cite conference |last1=Zheng |first1=Ying |last2=Davis |first2=Brian T. |last3=Jordan |first3=Matthew |date=10–12 March 2004 |title=Performance Evaluation of Exclusive Cache Hierarchies |url=http://mercury.pr.erau.edu/~davisb22/papers/ispass04.pdf |conference=IEEE International Symposium on Performance Analysis of Systems and Software |location=Austin, Texas, USA |pages=89–96 |doi=10.1109/ISPASS.2004.1291359 |isbn=0-7803-8385-0 |archive-url=https://web.archive.org/web/20120813003941/http://mercury.pr.erau.edu/~davisb22/papers/ispass04.pdf |archive-date=2012-08-13 |access-date=2014-06-09 |url-status=dead}}</ref> two common names are "non-exclusive" and "partially-inclusive".

The advantage of exclusive caches is that they store more data. This advantage is larger when the exclusive L1 cache is comparable to the L2 cache, and diminishes if the L2 cache is many times larger than the L1 cache. When the L1 misses and the L2 hits on an access, the hitting cache line in the L2 is exchanged with a line in the L1. This exchange is quite a bit more work than just copying a line from L2 to L1, which is what an inclusive cache does.<ref name="ispass04" />

One advantage of strictly inclusive caches is that when external devices or other processors in a multiprocessor system wish to remove a cache line from the processor, they need only have the processor check the L2 cache. In cache hierarchies which do not enforce inclusion, the L1 cache must be checked as well. As a drawback, there is a correlation between the associativities of L1 and L2 caches: if the L2 cache does not have at least as many ways as all L1 caches together, the effective associativity of the L1 caches is restricted. Another disadvantage of inclusive cache is that whenever there is an eviction in L2 cache, the (possibly) corresponding lines in L1 also have to get evicted in order to maintain inclusiveness. This is quite a bit of work, and would result in a higher L1 miss rate.<ref name="ispass04" />

Another advantage of inclusive caches is that the larger cache can use larger cache lines, which reduces the size of the secondary cache tags. (Exclusive caches require both caches to have the same size cache lines, so that cache lines can be swapped on a L1 miss, L2 hit.) If the secondary cache is an order of magnitude larger than the primary, and the cache data are an order of magnitude larger than the cache tags, this tag area saved can be comparable to the incremental area needed to store the L1 cache data in the L2.<ref>{{cite web |author1=Jaleel |first=Aamer |author2=Eric Borch |last3=Bhandaru |first3=Malini |last4=Steely Jr. |first4=Simon C. |last5=Emer |first5=Joel |date=2010-09-27 |title=Achieving Non-Inclusive Cache Performance with Inclusive Caches |url=http://www.jaleels.org/ajaleel/publications/micro2010-tla.pdf |access-date=2014-06-09 |website=jaleels.org}}</ref>