Editing CPU cache (section)

===Specialized caches===
Pipelined CPUs access memory from multiple points in the [[Instruction pipeline|pipeline]]: instruction fetch, [[virtual memory|virtual-to-physical]] address translation, and data fetch (see [[classic RISC pipeline]]). The natural design is to use different physical caches for each of these points, so that no one physical resource has to be scheduled to service two points in the pipeline. Thus the pipeline naturally ends up with at least three separate caches (instruction, [[translation lookaside buffer|TLB]], and data), each specialized to its particular role.

====Victim cache====
{{Main article|Victim cache}}
A '''victim cache''' is a cache used to hold blocks evicted from a CPU cache upon replacement. The victim cache lies between the main cache and its refill path, and holds only those blocks of data that were evicted from the main cache. The victim cache is usually fully associative, and is intended to reduce the number of conflict misses. Many commonly used programs do not require an associative mapping for all the accesses. In fact, only a small fraction of the memory accesses of the program require high associativity. The victim cache exploits this property by providing high associativity to only these accesses. It was introduced by [[Norman Jouppi]] from DEC in 1990.<ref name=Jouppi1990>{{cite conference 
 |last=Jouppi |first=Norman P.
 |date=May 1990
 |title=Improving direct-mapped cache performance by the addition of a small {{Sic|hide=y|fully|-}}associative cache and prefetch buffers
 |pages=364–373
 |book-title=Conference Proceedings of the 17th Annual International Symposium on Computer Architecture
 |conference=17th Annual International Symposium on Computer Architecture, May 28-31, 1990
 |location=Seattle, WA, USA
 |doi=10.1109/ISCA.1990.134547
}}</ref>

Intel's ''[[Crystalwell]]''<ref name="intel-ark-crystal-well">{{cite web
 | url = http://ark.intel.com/products/codename/51802/Crystal-Well
 | title = Products (Formerly Crystal Well)
 | publisher = [[Intel]]
 | access-date = 2013-09-15
}}</ref> variant of its [[Haswell (microarchitecture)|Haswell]] processors introduced an on-package 128&nbsp;MiB [[eDRAM]] Level 4 cache which serves as a victim cache to the processors' Level 3 cache.<ref name="anandtech-i74950hq">{{cite web
 | url = http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/3
 | title =  Intel Iris Pro 5200 Graphics Review: Core i7-4950HQ Tested
 | publisher = [[AnandTech]]
 | access-date = 2013-09-16
}}</ref> In the [[Skylake (microarchitecture)|Skylake]] microarchitecture the Level 4 cache no longer works as a victim cache.<ref>{{cite web |author=Cutress |first=Ian |date=September 2, 2015 |title=The Intel Skylake Mobile and Desktop Launch, with Architecture Analysis |url=http://www.anandtech.com/show/9582/intel-skylake-mobile-desktop-launch-architecture-analysis/5 |publisher=AnandTech}}</ref>

===={{Anchor|TRACE-CACHE}}Trace cache====
{{Main article|Trace cache}}

One of the more extreme examples of cache specialization is the '''trace cache''' (also known as ''execution trace cache'') found in the [[Intel]] [[Pentium&nbsp;4]] microprocessors. A trace cache is a mechanism for increasing the instruction fetch bandwidth and decreasing power consumption (in the case of the Pentium&nbsp;4) by storing traces of [[instruction (computer science)|instruction]]s that have already been fetched and decoded.<ref>{{cite web |author=Shimpi |first=Anand Lal |date=2000-11-20 |title=The Pentium 4's Cache – Intel Pentium&nbsp;4 1.4&nbsp;GHz & 1.5&nbsp;GHz |url=http://www.anandtech.com/show/661/5 |access-date=2015-11-30 |publisher=[[AnandTech]]}}</ref>

A trace cache stores instructions either after they have been decoded, or as they are retired. Generally, instructions are added to trace caches in groups representing either individual [[basic block]]s or dynamic instruction traces. The Pentium&nbsp;4's trace cache stores [[micro-operations]] resulting from decoding x86 instructions, providing also the functionality of a micro-operation cache. Having this, the next time an instruction is needed, it does not have to be decoded into micro-ops again.<ref name="agner.org" />{{rp|63&ndash;68}}

====Write Coalescing Cache (WCC)====
Write Coalescing Cache<ref>{{cite web |author=Kanter |first=David |date=August 26, 2010 |title=AMD's Bulldozer Microarchitecture – Memory Subsystem Continued |url=http://www.realworldtech.com/bulldozer/9/ |website=Real World Technologies}}</ref> is a special cache that is part of L2 cache in [[AMD]]'s [[Bulldozer (microarchitecture)|Bulldozer microarchitecture]]. Stores from both L1D caches in the module go through the WCC, where they are buffered and coalesced.
The WCC's task is reducing number of writes to the L2 cache.

===={{Anchor|UOP-CACHE}}Micro-operation (μop or uop) cache====
A '''micro-operation cache''' ('''μop cache''', '''uop cache''' or '''UC''')<ref>{{cite web |author=Kanter |first=David |date=September 25, 2010 |title=Intel's Sandy Bridge Microarchitecture – Instruction Decode and uop Cache |url=http://www.realworldtech.com/sandy-bridge/4/ |website=Real World Technologies}}</ref> is a specialized cache that stores [[micro-operation]]s of decoded instructions, as received directly from the [[instruction decoder]]s or from the instruction cache. When an instruction needs to be decoded, the μop cache is checked for its decoded form which is re-used if cached; if it is not available, the instruction is decoded and then cached.

One of the early works describing μop cache as an alternative frontend for the Intel [[P6 (microarchitecture)|P6 processor family]] is the 2001 paper ''"Micro-Operation Cache: A Power Aware Frontend for Variable Instruction Length ISA"''.<ref name="uop-intel">{{cite conference |conference=2001 International Symposium on Low Power Electronics and Design (ISLPED'01), August 6-7, 2001 |location=Huntington Beach, CA, USA |last1=Solomon |first1=Baruch |book-title=ISLPED'01: Proceedings of the 2001 International Symposium on Low Power Electronics and Design |last2=Mendelson |first2=Avi |last3=Orenstein |first3=Doron |last4=Almog |first4=Yoav |last5=Ronen |first5=Ronny |date=August 2001 |publisher=[[Association for Computing Machinery]] |isbn=978-1-58113-371-4 |pages=4–9 |title=Micro-Operation Cache: A Power Aware Frontend for Variable Instruction Length ISA |doi=10.1109/LPE.2001.945363 |access-date=2013-10-06 |url=http://cecs.uci.edu/~papers/compendium94-03/papers/2001/islped01/pdffiles/p004.pdf |s2cid=195859085}}</ref> Later, Intel included μop caches in its [[Sandy Bridge]] processors and in successive microarchitectures like [[Ivy Bridge (microarchitecture)|Ivy Bridge]] and [[Haswell (microarchitecture)|Haswell]].<ref name="agner.org">{{cite web |author=Fog |first=Agner |author-link=Agner Fog |date=2014-02-19 |title=The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers |url=http://www.agner.org/optimize/microarchitecture.pdf |access-date=2014-03-21 |website=agner.org}}</ref>{{rp|121&ndash;123}}<ref name="anandtech-haswell">{{cite web |author=Shimpi |first=Anand Lal |date=2012-10-05 |title=Intel's Haswell Architecture Analyzed |url=http://www.anandtech.com/show/6355/intels-haswell-architecture/6 |access-date=2013-10-20 |publisher=[[AnandTech]]}}</ref> AMD implemented a μop cache in their [[Zen (microarchitecture)|Zen microarchitecture]].<ref>{{cite web |author=Cutress |first=Ian |date=2016-08-18 |title=AMD Zen Microarchitecture: Dual Schedulers, Micro-Op Cache and Memory Hierarchy Revealed |url=http://www.anandtech.com/show/10578/amd-zen-microarchitecture-dual-schedulers-micro-op-cache-memory-hierarchy-revealed |access-date=2017-04-03 |publisher=AnandTech}}</ref>

Fetching complete pre-decoded instructions eliminates the need to repeatedly decode variable length complex instructions into simpler fixed-length micro-operations, and simplifies the process of predicting, fetching, rotating and aligning fetched instructions. A μop cache effectively offloads the fetch and decode hardware, thus decreasing [[power consumption]] and improving the frontend supply of decoded micro-operations. The μop cache also increases performance by more consistently delivering decoded micro-operations to the backend and eliminating various bottlenecks in the CPU's fetch and decode logic.<ref name="uop-intel" /><ref name="anandtech-haswell" />

A μop cache has many similarities with a trace cache, although a μop cache is much simpler thus providing better power efficiency; this makes it better suited for implementations on battery-powered devices. The main disadvantage of the trace cache, leading to its power inefficiency, is the hardware complexity required for its [[heuristic]] deciding on caching and reusing dynamically created instruction traces.<ref name="tc-slides">{{cite web |last1=Gu |first1=Leon |last2=Motiani |first2=Dipti |date=October 2003 |title=Trace Cache |url=https://www.cs.cmu.edu/afs/cs/academic/class/15740-f03/www/lectures/TraceCache_slides.pdf |access-date=2013-10-06}}</ref>

====Branch target instruction cache====
A '''branch target cache''' or '''branch target instruction cache''', the name used on [[ARM microprocessors]],<ref>{{cite web |author=Niu |first=Kun |date=28 May 2015 |title=How does the BTIC (branch target instruction cache) work? |url=https://community.arm.com/processors/f/discussions/5320/how-does-the-btic-branch-target-instruction-cache-works |access-date=7 April 2018}}</ref> is a specialized cache which holds the first few instructions at the destination of a taken branch. This is used by low-powered processors which do not need a normal instruction cache because the memory system is capable of delivering instructions fast enough to satisfy the CPU without one. However, this only applies to consecutive instructions in sequence; it still takes several cycles of latency to restart instruction fetch at a new address, causing a few cycles of pipeline bubble after a control transfer. A branch target cache provides instructions for those few cycles avoiding a delay after most taken branches.

This allows full-speed operation with a much smaller cache than a traditional full-time instruction cache.

====Smart cache====
'''Smart cache''' is a [[#MULTILEVEL|level 2]] or [[#MULTILEVEL|level 3]] caching method for multiple execution cores, developed by [[Intel]].

Smart Cache shares the actual cache memory between the cores of a [[multi-core processor]]. In comparison to a dedicated per-core cache, the overall [[cache miss]] rate decreases when cores do not require equal parts of the cache space. Consequently, a single core can use the full level 2 or level 3 cache while the other cores are inactive.<ref>{{cite web|url=http://www.intel.com/content/www/us/en/architecture-and-technology/intel-smart-cache.html|title=Intel Smart Cache: Demo|publisher=[[Intel]]|access-date=2012-01-26}}</ref> Furthermore, the shared cache makes it faster to share memory among different execution cores.<ref>{{cite web |url=http://software.intel.com/file/18374/ |archive-url=https://web.archive.org/web/20111229193036/http://software.intel.com/file/18374/ |title=Inside Intel Core Microarchitecture and Smart Memory Access |format=PDF |page=5 |publisher=[[Intel]] |year=2006 |access-date=2012-01-26 |archive-date=2011-12-29 |url-status=dead}}</ref>