Editing CPU cache (section)

==Implementation==
{{Main article|Cache algorithms}}

Cache '''reads''' are the most common CPU operation that takes more than a single cycle. Program execution time tends to be very sensitive to the latency of a level-1 data cache hit. A great deal of design effort, and often power and silicon area are expended making the caches as fast as possible.

The simplest cache is a virtually indexed direct-mapped cache. The virtual address is calculated with an adder, the relevant portion of the address extracted and used to index an SRAM, which returns the loaded data. The data are byte aligned in a byte shifter, and from there are bypassed to the next operation. There is no need for any tag checking in the inner loop{{snd}}in fact, the tags need not even be read. Later in the pipeline, but before the load instruction is retired, the tag for the loaded data must be read, and checked against the virtual address to make sure there was a cache hit. On a miss, the cache is updated with the requested cache line and the pipeline is restarted.

An associative cache is more complicated, because some form of tag must be read to determine which entry of the cache to select. An N-way set-associative level-1 cache usually reads all N possible tags and N data in parallel, and then chooses the data associated with the matching tag. Level-2 caches sometimes save power by reading the tags first, so that only one data element is read from the data SRAM.

[[Image:Cache,associative-read.svg|thumb|right|upright=1.6|Read path for a 2-way associative cache]]

The adjacent diagram is intended to clarify the manner in which the various fields of the address are used. Address bit 31 is most significant, bit 0 is least significant. The diagram shows the SRAMs, indexing, and [[multiplexing]] for a 4&nbsp;KiB, 2-way set-associative, virtually indexed and virtually tagged cache with 64&nbsp;byte (B) lines, a 32-bit read width and 32-bit virtual address.

Because the cache is 4&nbsp;KiB and has 64&nbsp;B lines, there are just 64 lines in the cache, and we read two at a time from a Tag SRAM which has 32 rows, each with a pair of 21 bit tags. Although any function of virtual address bits 31 through 6 could be used to index the tag and data SRAMs, it is simplest to use the least significant bits.

Similarly, because the cache is 4&nbsp;KiB and has a 4&nbsp;B read path, and reads two ways for each access, the Data SRAM is 512 rows by 8 bytes wide.

A more modern cache might be 16&nbsp;KiB, 4-way set-associative, virtually indexed, virtually hinted, and physically tagged, with 32&nbsp;B lines, 32-bit read width and 36-bit physical addresses. The read path recurrence for such a cache looks very similar to the path above. Instead of tags, virtual hints are read, and matched against a subset of the virtual address. Later on in the pipeline, the virtual address is translated into a physical address by the TLB, and the physical tag is read (just one, as the virtual hint supplies which way of the cache to read). Finally the physical address is compared to the physical tag to determine if a hit has occurred.

Some SPARC designs have improved the speed of their L1 caches by a few gate delays by collapsing the virtual address adder into the SRAM decoders. See [[sum-addressed decoder]].

===History===
The early history of cache technology is closely tied to the invention and use of virtual memory.{{Citation needed|date=March 2008}} <!-- this *is* a truly interesting observation, but are there any sources? Also why was the "CPU-mam speed" part deleted? ~~~~ --> Because of scarcity and cost of semi-conductor memories, early mainframe computers in the 1960s used a complex hierarchy of physical memory, mapped onto a flat virtual memory space used by programs. The memory technologies would span semi-conductor, magnetic core, drum and disc. Virtual memory seen and used by programs would be flat and caching would be used to fetch data and instructions into the fastest memory ahead of processor access. Extensive studies were done to optimize the cache sizes. Optimal values were found to depend greatly on the programming language used with Algol needing the smallest and Fortran and Cobol needing the largest cache sizes.{{Disputed inline|Talk:CPU cache#Dispute sequence of events for paging|reason=sequence of events wrong|date=December 2010}}

In the early days of microcomputer technology, memory access was only slightly slower than [[processor register|register]] access. But since the 1980s<ref>{{cite journal | url=https://epic.hpi.uni-potsdam.de/pub/Home/TrendsAndConceptsII2010/HW_Trends_The_Processor-Memory_bottleneck___Problems_and_Solutions..pdf | title=The processor-memory bottleneck: problems and solutions | journal=Crossroads | volume=5 | issue=3es | pages=2–es | first1=Nihar R. | last1=Mahapatra | first2=Balakrishna | last2=Venkatrao | access-date=2013-03-05 | doi=10.1145/357783.331677 | year=1999 | s2cid=11557476 | url-status=dead | archive-url=https://web.archive.org/web/20140305193233/https://epic.hpi.uni-potsdam.de/pub/Home/TrendsAndConceptsII2010/HW_Trends_The_Processor-Memory_bottleneck___Problems_and_Solutions..pdf | archive-date=2014-03-05 }}</ref> the performance gap between processor and memory has been growing. Microprocessors have advanced much faster than memory, especially in terms of their operating [[frequency]], so memory became a performance [[Von Neumann architecture#Von Neumann bottleneck|bottleneck]]. While it was technically possible to have all the main memory as fast as the CPU, a more economically viable path has been taken: use plenty of low-speed memory, but also introduce a small high-speed cache memory to alleviate the performance gap. This provided an order of magnitude more capacity—for the same price—with only a slightly reduced combined performance.

====First TLB implementations====
The first documented uses of a TLB were on the [[GE 645]]<ref>{{cite book
 |    publisher = [[General Electric]]
 |    title = GE-645 System Manual
 |    date = January 1968
 |    url = http://bitsavers.org/pdf/ge/GE-645/LSB0468_GE-645_System_Manual_Jan1968.pdf
 |    access-date = 2020-07-10
 }}</ref> and the [[IBM]] [[IBM System/360 Model 67|360/67]],<ref>{{cite book
 |publisher    = [[IBM]]
 |title        = IBM System/360 Model 67 Functional Characteristics
 |id           = GA27-2719-2
 |url          = http://www.bitsavers.org/pdf/ibm/360/functional_characteristics/GA27-2719-2_360-67_funcChar.pdf
 |version      = Third Edition
 |date         = February 1972
}}</ref> both of which used an associative memory as a TLB.

====First instruction cache====
The first documented use of an instruction cache was on the [[CDC 6600]].<ref>{{cite conference |author=Thornton |first=James E. |book-title=Proceedings of the October 27–29, 1964, fall joint computer conference, part II: very high speed computer systems |date=October 1964 |title=Parallel operation in the control data 6600 |url=https://cs.uwaterloo.ca/~mashti/cs850-f18/papers/cdc6600.pdf}}</ref>

====First data cache====
The first documented use of a data cache was on the [[IBM]] System/360 Model 85.<ref>{{cite book |author=IBM |url=http://www.bitsavers.org/pdf/ibm/360/functional_characteristics/A22-6916-1_360-85_funcChar_Jun68.pdf |title=IBM System/360 Model 85 Functional Characteristics |date=June 1968 |edition=2nd |language=en-us |id=A22-6916-1}}</ref>

====In 68k microprocessors====
The [[68010]], released in 1982, has a "loop mode" which can be considered a tiny and special-case instruction cache that accelerates loops that consist of only two instructions. The [[68020]], released in 1984, replaced that with a typical instruction cache of 256 bytes, being the first 68k series processor to feature true on-chip cache memory.

The [[68030]], released in 1987, is basically a 68020 core with an additional 256-byte data cache, an on-chip [[memory management unit]] (MMU), a process shrink, and added burst mode for the caches.

The [[Motorola 68040|68040]], released in 1990, has split instruction and data caches of four kilobytes each.

The [[68060]], released in 1994, has the following: 8&nbsp;KiB data cache (four-way associative), 8&nbsp;KiB instruction cache (four-way associative), 96-byte FIFO instruction buffer, 256-entry branch cache, and 64-entry address translation cache MMU buffer (four-way associative).

====In x86 microprocessors====
[[File:Motherboard Intel 386.jpg|thumb|upright=1.2|Example of a motherboard with an [[i386]] microprocessor (33&nbsp;MHz), 64&nbsp;KiB cache (25&nbsp;ns; 8 chips in the bottom left corner), 2&nbsp;MiB DRAM (70&nbsp;ns; 8 [[SIMM]]s to the right of the cache), and a cache controller ([[Austek Microsystems|Austek]] A38202; to the right of the processor)]]
As the [[x86]] microprocessors reached clock rates of 20&nbsp;MHz and above in the [[Intel 80386|386]], small amounts of fast cache memory began to be featured in systems to improve performance. This was because the [[DRAM]] used for main memory had significant latency, up to 120&nbsp;ns, as well as refresh cycles. The cache was constructed from more expensive, but significantly faster, [[static random-access memory|SRAM]] [[Memory cell (computing)|memory cells]], which at the time had latencies around 10–25&nbsp;ns. The early caches were external to the processor and typically located on the motherboard in the form of eight or nine [[Dual in-line package|DIP]] devices placed in sockets to enable the cache as an optional extra or upgrade feature.

Some versions of the Intel 386 processor could support 16 to 256&nbsp;KiB of external cache.

With the [[Intel 80486|486]] processor, an 8&nbsp;KiB cache was integrated directly into the CPU die. This cache was termed Level 1 or L1 cache to differentiate it from the slower on-motherboard, or Level 2 (L2) cache. These on-motherboard caches were much larger, with the most common size being 256&nbsp;KiB. There were some system boards that contained sockets for the Intel 485Turbocache [[Expansion_card#Daughterboard|daughtercard]] which had either 64 or 128 Kbyte of cache memory.<ref>Chen, Allan, "The 486 CPU: ON A High-Performance Flight Vector", Intel Corporation, Microcomputer Solutions, November/December 1990, p. 2</ref><ref>Reilly, James, Kheradpir, Shervin, "An Overview of High-performance Hardware Design Using the 486 CPU", Intel Corporation, Microcomputer Solutions, November/December 1990, page 20</ref> The popularity of on-motherboard cache continued through the [[Intel P5|Pentium MMX]] era but was made obsolete by the introduction of [[SDRAM]] and the growing disparity between bus clock rates and CPU clock rates, which caused on-motherboard cache to be only slightly faster than main memory.

The next development in cache implementation in the x86 microprocessors began with the [[Pentium Pro]], which brought the secondary cache onto the same package as the microprocessor, clocked at the same frequency as the microprocessor.

On-motherboard caches enjoyed prolonged popularity thanks to the [[AMD K6-2]] and [[AMD K6-III]] processors that still used [[Socket 7]], which was previously used by Intel with on-motherboard caches. K6-III included 256&nbsp;KiB on-die L2 cache and took advantage of the on-board cache as a third level cache, named L3 (motherboards with up to 2&nbsp;MiB of on-board cache were produced). After the Socket&nbsp;7 became obsolete, on-motherboard cache disappeared from the x86 systems.

The three-level caches were used again first with the introduction of multiple processor cores, where the L3 cache was added to the CPU die. It became common for the total cache sizes to be increasingly larger in newer processor generations, and recently (as of 2011) it is not uncommon to find Level 3 cache sizes of tens of megabytes.<ref>{{cite web
 | url = http://ark.intel.com/products/family/59139/Intel-Xeon-Processor-E7-Family/server
 | title = Intel Xeon Processor E7 Family
 | work = Intel® ARK (Product Specs)
 | access-date = 2013-10-10
 | publisher = [[Intel]]
}}</ref>

[[Intel]] introduced a Level 4 on-package cache with the [[Haswell (microarchitecture)|Haswell]] [[microarchitecture]]. ''[[Crystalwell]]''<ref name="intel-ark-crystal-well" /> Haswell CPUs, equipped with the [[GT3e]] variant of Intel's integrated Iris Pro graphics, effectively feature 128&nbsp;MiB of embedded DRAM ([[eDRAM]]) on the same package. This L4 cache is shared dynamically between the on-die GPU and CPU, and serves as a [[victim cache]] to the CPU's L3 cache.<ref name="anandtech-i74950hq" />

====In ARM microprocessors====
[[Apple M1]] CPU has 128 or 192&nbsp;KiB instruction L1 cache for each core (important for latency/single-thread performance), depending on core type. This is an unusually large L1 cache for any CPU type (not just for a laptop); the total cache memory size is not unusually large (the total is more important for throughput) for a laptop, and much larger total (e.g. L3 or L4) sizes are available in IBM's mainframes.

====Current research====
Early cache designs focused entirely on the direct cost of cache and [[random-access memory|RAM]] and average execution speed.
More recent cache designs also consider [[low-power electronics|energy efficiency]], fault tolerance, and other goals.<ref>{{cite journal
 |url=https://spectrum.ieee.org/chip-design-thwarts-sneak-attack-on-data
 |title=Chip Design Thwarts Sneak Attack on Data
 |author=Sally Adee
 |date=November 2009
 |journal=[[IEEE Spectrum]]
 |volume=46
 |issue=11
 |page=16
 |doi=10.1109/MSPEC.2009.5292036
|s2cid=43892134
 |url-access=subscription
 }}</ref><ref>{{cite conference |last1=Wang |first1=Zhenghong |last2=Lee |first2=Ruby B. |date=November 8–12, 2008 |title=A novel cache architecture with enhanced performance and security |url=http://palms.princeton.edu/system/files/Micro08_Newcache.pdf |conference=41st annual IEEE/ACM International Symposium on Microarchitecture |pages=83–93 |archive-url=https://web.archive.org/web/20120306225926/http://palms.princeton.edu/system/files/Micro08_Newcache.pdf |archive-date=March 6, 2012 |url-status=live}}</ref>

There are several tools available to computer architects to help explore tradeoffs between the cache cycle time, energy, and area; the CACTI cache simulator<ref>{{cite web|url=https://www.hpl.hp.com/research/cacti/ |title=CACTI |website=HP Labs |access-date=2023-01-29}}</ref> and the SimpleScalar instruction set simulator are two open-source options.

===Multi-ported cache===
A multi-ported cache is a cache which can serve more than one request at a time. When accessing a traditional cache we normally use a single memory address, whereas in a multi-ported cache we may request N addresses at a time{{snd}}where N is the number of ports that connected through the processor and the cache. The benefit of this is that a pipelined processor may access memory from different phases in its pipeline. Another benefit is that it allows the concept of super-scalar processors through different cache levels.