Editing CPU cache (section)

===History===
The early history of cache technology is closely tied to the invention and use of virtual memory.{{Citation needed|date=March 2008}} <!-- this *is* a truly interesting observation, but are there any sources? Also why was the "CPU-mam speed" part deleted? ~~~~ --> Because of scarcity and cost of semi-conductor memories, early mainframe computers in the 1960s used a complex hierarchy of physical memory, mapped onto a flat virtual memory space used by programs. The memory technologies would span semi-conductor, magnetic core, drum and disc. Virtual memory seen and used by programs would be flat and caching would be used to fetch data and instructions into the fastest memory ahead of processor access. Extensive studies were done to optimize the cache sizes. Optimal values were found to depend greatly on the programming language used with Algol needing the smallest and Fortran and Cobol needing the largest cache sizes.{{Disputed inline|Talk:CPU cache#Dispute sequence of events for paging|reason=sequence of events wrong|date=December 2010}}

In the early days of microcomputer technology, memory access was only slightly slower than [[processor register|register]] access. But since the 1980s<ref>{{cite journal | url=https://epic.hpi.uni-potsdam.de/pub/Home/TrendsAndConceptsII2010/HW_Trends_The_Processor-Memory_bottleneck___Problems_and_Solutions..pdf | title=The processor-memory bottleneck: problems and solutions | journal=Crossroads | volume=5 | issue=3es | pages=2–es | first1=Nihar R. | last1=Mahapatra | first2=Balakrishna | last2=Venkatrao | access-date=2013-03-05 | doi=10.1145/357783.331677 | year=1999 | s2cid=11557476 | url-status=dead | archive-url=https://web.archive.org/web/20140305193233/https://epic.hpi.uni-potsdam.de/pub/Home/TrendsAndConceptsII2010/HW_Trends_The_Processor-Memory_bottleneck___Problems_and_Solutions..pdf | archive-date=2014-03-05 }}</ref> the performance gap between processor and memory has been growing. Microprocessors have advanced much faster than memory, especially in terms of their operating [[frequency]], so memory became a performance [[Von Neumann architecture#Von Neumann bottleneck|bottleneck]]. While it was technically possible to have all the main memory as fast as the CPU, a more economically viable path has been taken: use plenty of low-speed memory, but also introduce a small high-speed cache memory to alleviate the performance gap. This provided an order of magnitude more capacity—for the same price—with only a slightly reduced combined performance.

====First TLB implementations====
The first documented uses of a TLB were on the [[GE 645]]<ref>{{cite book
 |    publisher = [[General Electric]]
 |    title = GE-645 System Manual
 |    date = January 1968
 |    url = http://bitsavers.org/pdf/ge/GE-645/LSB0468_GE-645_System_Manual_Jan1968.pdf
 |    access-date = 2020-07-10
 }}</ref> and the [[IBM]] [[IBM System/360 Model 67|360/67]],<ref>{{cite book
 |publisher    = [[IBM]]
 |title        = IBM System/360 Model 67 Functional Characteristics
 |id           = GA27-2719-2
 |url          = http://www.bitsavers.org/pdf/ibm/360/functional_characteristics/GA27-2719-2_360-67_funcChar.pdf
 |version      = Third Edition
 |date         = February 1972
}}</ref> both of which used an associative memory as a TLB.

====First instruction cache====
The first documented use of an instruction cache was on the [[CDC 6600]].<ref>{{cite conference |author=Thornton |first=James E. |book-title=Proceedings of the October 27–29, 1964, fall joint computer conference, part II: very high speed computer systems |date=October 1964 |title=Parallel operation in the control data 6600 |url=https://cs.uwaterloo.ca/~mashti/cs850-f18/papers/cdc6600.pdf}}</ref>

====First data cache====
The first documented use of a data cache was on the [[IBM]] System/360 Model 85.<ref>{{cite book |author=IBM |url=http://www.bitsavers.org/pdf/ibm/360/functional_characteristics/A22-6916-1_360-85_funcChar_Jun68.pdf |title=IBM System/360 Model 85 Functional Characteristics |date=June 1968 |edition=2nd |language=en-us |id=A22-6916-1}}</ref>

====In 68k microprocessors====
The [[68010]], released in 1982, has a "loop mode" which can be considered a tiny and special-case instruction cache that accelerates loops that consist of only two instructions. The [[68020]], released in 1984, replaced that with a typical instruction cache of 256 bytes, being the first 68k series processor to feature true on-chip cache memory.

The [[68030]], released in 1987, is basically a 68020 core with an additional 256-byte data cache, an on-chip [[memory management unit]] (MMU), a process shrink, and added burst mode for the caches.

The [[Motorola 68040|68040]], released in 1990, has split instruction and data caches of four kilobytes each.

The [[68060]], released in 1994, has the following: 8&nbsp;KiB data cache (four-way associative), 8&nbsp;KiB instruction cache (four-way associative), 96-byte FIFO instruction buffer, 256-entry branch cache, and 64-entry address translation cache MMU buffer (four-way associative).

====In x86 microprocessors====
[[File:Motherboard Intel 386.jpg|thumb|upright=1.2|Example of a motherboard with an [[i386]] microprocessor (33&nbsp;MHz), 64&nbsp;KiB cache (25&nbsp;ns; 8 chips in the bottom left corner), 2&nbsp;MiB DRAM (70&nbsp;ns; 8 [[SIMM]]s to the right of the cache), and a cache controller ([[Austek Microsystems|Austek]] A38202; to the right of the processor)]]
As the [[x86]] microprocessors reached clock rates of 20&nbsp;MHz and above in the [[Intel 80386|386]], small amounts of fast cache memory began to be featured in systems to improve performance. This was because the [[DRAM]] used for main memory had significant latency, up to 120&nbsp;ns, as well as refresh cycles. The cache was constructed from more expensive, but significantly faster, [[static random-access memory|SRAM]] [[Memory cell (computing)|memory cells]], which at the time had latencies around 10–25&nbsp;ns. The early caches were external to the processor and typically located on the motherboard in the form of eight or nine [[Dual in-line package|DIP]] devices placed in sockets to enable the cache as an optional extra or upgrade feature.

Some versions of the Intel 386 processor could support 16 to 256&nbsp;KiB of external cache.

With the [[Intel 80486|486]] processor, an 8&nbsp;KiB cache was integrated directly into the CPU die. This cache was termed Level 1 or L1 cache to differentiate it from the slower on-motherboard, or Level 2 (L2) cache. These on-motherboard caches were much larger, with the most common size being 256&nbsp;KiB. There were some system boards that contained sockets for the Intel 485Turbocache [[Expansion_card#Daughterboard|daughtercard]] which had either 64 or 128 Kbyte of cache memory.<ref>Chen, Allan, "The 486 CPU: ON A High-Performance Flight Vector", Intel Corporation, Microcomputer Solutions, November/December 1990, p. 2</ref><ref>Reilly, James, Kheradpir, Shervin, "An Overview of High-performance Hardware Design Using the 486 CPU", Intel Corporation, Microcomputer Solutions, November/December 1990, page 20</ref> The popularity of on-motherboard cache continued through the [[Intel P5|Pentium MMX]] era but was made obsolete by the introduction of [[SDRAM]] and the growing disparity between bus clock rates and CPU clock rates, which caused on-motherboard cache to be only slightly faster than main memory.

The next development in cache implementation in the x86 microprocessors began with the [[Pentium Pro]], which brought the secondary cache onto the same package as the microprocessor, clocked at the same frequency as the microprocessor.

On-motherboard caches enjoyed prolonged popularity thanks to the [[AMD K6-2]] and [[AMD K6-III]] processors that still used [[Socket 7]], which was previously used by Intel with on-motherboard caches. K6-III included 256&nbsp;KiB on-die L2 cache and took advantage of the on-board cache as a third level cache, named L3 (motherboards with up to 2&nbsp;MiB of on-board cache were produced). After the Socket&nbsp;7 became obsolete, on-motherboard cache disappeared from the x86 systems.

The three-level caches were used again first with the introduction of multiple processor cores, where the L3 cache was added to the CPU die. It became common for the total cache sizes to be increasingly larger in newer processor generations, and recently (as of 2011) it is not uncommon to find Level 3 cache sizes of tens of megabytes.<ref>{{cite web
 | url = http://ark.intel.com/products/family/59139/Intel-Xeon-Processor-E7-Family/server
 | title = Intel Xeon Processor E7 Family
 | work = Intel® ARK (Product Specs)
 | access-date = 2013-10-10
 | publisher = [[Intel]]
}}</ref>

[[Intel]] introduced a Level 4 on-package cache with the [[Haswell (microarchitecture)|Haswell]] [[microarchitecture]]. ''[[Crystalwell]]''<ref name="intel-ark-crystal-well" /> Haswell CPUs, equipped with the [[GT3e]] variant of Intel's integrated Iris Pro graphics, effectively feature 128&nbsp;MiB of embedded DRAM ([[eDRAM]]) on the same package. This L4 cache is shared dynamically between the on-die GPU and CPU, and serves as a [[victim cache]] to the CPU's L3 cache.<ref name="anandtech-i74950hq" />

====In ARM microprocessors====
[[Apple M1]] CPU has 128 or 192&nbsp;KiB instruction L1 cache for each core (important for latency/single-thread performance), depending on core type. This is an unusually large L1 cache for any CPU type (not just for a laptop); the total cache memory size is not unusually large (the total is more important for throughput) for a laptop, and much larger total (e.g. L3 or L4) sizes are available in IBM's mainframes.

====Current research====
Early cache designs focused entirely on the direct cost of cache and [[random-access memory|RAM]] and average execution speed.
More recent cache designs also consider [[low-power electronics|energy efficiency]], fault tolerance, and other goals.<ref>{{cite journal
 |url=https://spectrum.ieee.org/chip-design-thwarts-sneak-attack-on-data
 |title=Chip Design Thwarts Sneak Attack on Data
 |author=Sally Adee
 |date=November 2009
 |journal=[[IEEE Spectrum]]
 |volume=46
 |issue=11
 |page=16
 |doi=10.1109/MSPEC.2009.5292036
|s2cid=43892134
 |url-access=subscription
 }}</ref><ref>{{cite conference |last1=Wang |first1=Zhenghong |last2=Lee |first2=Ruby B. |date=November 8–12, 2008 |title=A novel cache architecture with enhanced performance and security |url=http://palms.princeton.edu/system/files/Micro08_Newcache.pdf |conference=41st annual IEEE/ACM International Symposium on Microarchitecture |pages=83–93 |archive-url=https://web.archive.org/web/20120306225926/http://palms.princeton.edu/system/files/Micro08_Newcache.pdf |archive-date=March 6, 2012 |url-status=live}}</ref>

There are several tools available to computer architects to help explore tradeoffs between the cache cycle time, energy, and area; the CACTI cache simulator<ref>{{cite web|url=https://www.hpl.hp.com/research/cacti/ |title=CACTI |website=HP Labs |access-date=2023-01-29}}</ref> and the SimpleScalar instruction set simulator are two open-source options.