Editing CPU cache (section)

=={{Anchor|ADDRTRANS}}Address translation==
Most general purpose CPUs implement some form of [[virtual memory]]. To summarize, either each program running on the machine sees its own simplified [[address space]], which contains code and data for that program only, or all programs run in a common virtual address space. A program executes by calculating, comparing, reading and writing to addresses of its virtual address space, rather than addresses of physical address space, making programs simpler and thus easier to write.

Virtual memory requires the processor to translate virtual addresses generated by the program into physical addresses in main memory. The portion of the processor that does this translation is known as the [[memory management unit]] (MMU). The fast path through the MMU can perform those translations stored in the [[translation lookaside buffer]] (TLB), which is a cache of mappings from the operating system's [[page table]], segment table, or both.

For the purposes of the present discussion, there are three important features of address translation:

* ''Latency:'' The physical address is available from the MMU some time, perhaps a few cycles, after the virtual address is available from the address generator.
* {{Anchor|Virtual aliasing}}''Aliasing:'' Multiple virtual addresses can map to a single physical address. Most processors guarantee that all updates to that single physical address will happen in program order. To deliver on that guarantee, the processor must ensure that only one copy of a physical address resides in the cache at any given time.
* ''Granularity:'' The virtual address space is broken up into pages. For instance, a 4&nbsp;[[Gibibyte|GiB]] virtual address space might be cut up into 1,048,576 pages of 4&nbsp;KiB size, each of which can be independently mapped. There may be multiple page sizes supported; see [[virtual memory]] for elaboration.

One early virtual memory system, the [[IBM M44/44X]], required an access to a mapping table held in [[core memory]] before every programmed access to main memory.<ref>{{cite conference |author=O'Neill |first=R. W. |title=Experience using a time sharing multiprogramming system with dynamic address relocation hardware |conference=Proc. AFIPS Computer Conference 30 (Spring Joint Computer Conference, 1967) |pages=611–621 |doi=10.1145/1465482.1465581}}</ref>{{refn|The very first paging machine, the [[Ferranti]] [[Atlas Computer (Manchester)|Atlas]]<ref name=AtlasCPU/><ref name=AtlasSup/> had no page tables in main memory; there was an associative memory with one entry for every 512 word page frame of core.|group=NB}} With no caches, and with the mapping table memory running at the same speed as main memory this effectively cut the speed of memory access in half. Two early machines that used a [[page table]] in main memory for mapping, the [[IBM System/360 Model 67]] and the [[GE 645]], both had a small associative memory as a cache for accesses to the in-memory page table. Both machines predated the first machine with a cache for main memory, the [[IBM System/360 Model 85]], so the first hardware cache used in a computer system was not a data or instruction cache, but rather a TLB.

Caches can be divided into four types, based on whether the index or tag correspond to physical or virtual addresses:

* ''Physically indexed, physically tagged'' (PIPT) caches use the physical address for both the index and the tag. While this is simple and avoids problems with aliasing, it is also slow, as the physical address must be looked up (which could involve a TLB miss and access to main memory) before that address can be looked up in the cache.
* ''Virtually indexed, virtually tagged'' (VIVT) caches use the virtual address for both the index and the tag. This caching scheme can result in much faster lookups, since the MMU does not need to be consulted first to determine the physical address for a given virtual address. However, VIVT suffers from aliasing problems, where several different virtual addresses may refer to the same physical address. The result is that such addresses would be cached separately despite referring to the same memory, causing coherency problems. Although solutions to this problem exist<ref>{{cite conference|last1=Kaxiras|first1=Stefanos|last2=Ros|first2=Alberto|title=A New Perspective for Efficient Virtual-Cache Coherence|conference=40th International Symposium on Computer Architecture (ISCA)|date=2013|pages=535–547|doi=10.1145/2485922.2485968|isbn=9781450320795|citeseerx=10.1.1.307.9125|s2cid=15434231}}</ref> they do not work for standard coherence protocols. Another problem is homonyms, where the same virtual address maps to several different physical addresses. It is not possible to distinguish these mappings merely by looking at the virtual index itself, though potential solutions include: flushing the cache after a [[context switch]], forcing address spaces to be non-overlapping, tagging the virtual address with an address space ID (ASID). Additionally, there is a problem that virtual-to-physical mappings can change, which would require flushing cache lines, as the VAs would no longer be valid. All these issues are absent if tags use physical addresses (VIPT).
* ''Virtually indexed, physically tagged'' (VIPT) caches use the virtual address for the index and the physical address in the tag. The advantage over PIPT is lower latency, as the cache line can be looked up in parallel with the TLB translation, however the tag cannot be compared until the physical address is available. The advantage over VIVT is that since the tag has the physical address, the cache can detect homonyms. Theoretically, VIPT requires more tags bits because some of the index bits could differ between the virtual and physical addresses (for example bit 12 and above for 4 KiB pages) and would have to be included both in the virtual index and in the physical tag. In practice this is not an issue because, in order to avoid coherency problems, VIPT caches are designed to have no such index bits (e.g., by limiting the total number of bits for the index and the block offset to 12 for 4 KiB pages); this limits the size of VIPT caches to the page size times the associativity of the cache.
* ''Physically indexed, virtually tagged'' (PIVT) caches are often claimed in literature to be useless and non-existing.<ref>{{cite magazine |url=http://www.linuxjournal.com/article/7105 |last=Bottomley |first=James |title=Understanding Caching |magazine=Linux Journal |date=2004 |access-date=2010-05-02}}</ref> However, the [[MIPS architecture|MIPS]] [[R6000]] uses this cache type as the sole known implementation.<ref>{{cite journal|first1=George|last1=Taylor|first2=Peter|last2=Davies|first3=Michael|last3=Farmwald|title=The TLB Slice - A Low-Cost High-Speed Address Translation Mechanism|journal=ACM SIGARCH Computer Architecture News|volume=18|issue=2SI|pages=355–363|doi=10.1145/325096.325161|year=1990}}</ref> The R6000 is implemented in [[emitter-coupled logic]], which is an extremely fast technology not suitable for large memories such as a [[Translation lookaside buffer|TLB]]. The R6000 solves the issue by putting the TLB memory into a reserved part of the second-level cache having a tiny, high-speed TLB "slice" on chip. The cache is indexed by the physical address obtained from the TLB slice. However, since the TLB slice only translates those virtual address bits that are necessary to index the cache and does not use any tags, false cache hits may occur, which is solved by tagging with the virtual address.

The speed of this recurrence (the ''load latency'') is crucial to CPU performance, and so most modern level-1 caches are virtually indexed, which at least allows the MMU's TLB lookup to proceed in parallel with fetching the data from the cache RAM.

But virtual indexing is not the best choice for all cache levels. The cost of dealing with virtual aliases grows with cache size, and as a result most level-2 and larger caches are physically indexed.

Caches have historically used both virtual and physical addresses for the cache tags, although virtual tagging is now uncommon. If the TLB lookup can finish before the cache RAM lookup, then the physical address is available in time for tag compare, and there is no need for virtual tagging. Large caches, then, tend to be physically tagged, and only small, very low latency caches are virtually tagged. In recent general-purpose CPUs, virtual tagging has been superseded by virtual hints, as described below.

===Homonym and synonym problems===
A cache that relies on virtual indexing and tagging becomes inconsistent after the same virtual address is mapped into different physical addresses ([[homonym]]), which can be solved by using physical address for tagging, or by storing the address space identifier in the cache line. However, the latter approach does not help against the [[synonym]] problem, in which several cache lines end up storing data for the same physical address. Writing to such locations may update only one location in the cache, leaving the others with inconsistent data. This issue may be solved by using non-overlapping memory layouts for different address spaces, or otherwise the cache (or a part of it) must be flushed when the mapping changes.<ref>{{cite web |last1=Roscoe |first1=Timothy |last2=Baumann |first2=Andrew |date=2009-03-03 |title=Advanced Operating Systems Caches and TLBs (263-3800-00L) |url=http://www.systems.ethz.ch/education/courses/fs09/aos/lectures/wk3-print.pdf |url-status=dead |archive-url=https://web.archive.org/web/20111007150424/http://www.systems.ethz.ch/education/past-courses/fs09/aos/lectures/wk3-print.pdf |archive-date=2011-10-07 |access-date=2016-02-14 |website=systems.ethz.ch}}</ref>

===Virtual tags and hints===
The great advantage of virtual tags is that, for associative caches, they allow the tag match to proceed before the virtual to physical translation is done. However, coherence probes and evictions present a physical address for action. The hardware must have some means of converting the physical addresses into a cache index, generally by storing physical tags as well as virtual tags. For comparison, a physically tagged cache does not need to keep virtual tags, which is simpler. When a virtual to physical mapping is deleted from the TLB, cache entries with those virtual addresses will have to be flushed somehow. Alternatively, if cache entries are allowed on pages not mapped by the TLB, then those entries will have to be flushed when the access rights on those pages are changed in the page table.

It is also possible for the operating system to ensure that no virtual aliases are simultaneously resident in the cache. The operating system makes this guarantee by enforcing page coloring, which is described below. Some early RISC processors (SPARC, RS/6000) took this approach. It has not been used recently, as the hardware cost of detecting and evicting virtual aliases has fallen and the software complexity and performance penalty of perfect page coloring has risen.

It can be useful to distinguish the two functions of tags in an associative cache: they are used to determine which way of the entry set to select, and they are used to determine if the cache hit or missed. The second function must always be correct, but it is permissible for the first function to guess, and get the wrong answer occasionally.

Some processors (e.g. early SPARCs) have caches with both virtual and physical tags. The virtual tags are used for way selection, and the physical tags are used for determining hit or miss. This kind of cache enjoys the latency advantage of a virtually tagged cache, and the simple software interface of a physically tagged cache. It bears the added cost of duplicated tags, however. Also, during miss processing, the alternate ways of the cache line indexed have to be probed for virtual aliases and any matches evicted.

The extra area (and some latency) can be mitigated by keeping ''virtual hints'' with each cache entry instead of virtual tags. These hints are a subset or hash of the virtual tag, and are used for selecting the way of the cache from which to get data and a physical tag. Like a virtually tagged cache, there may be a virtual hint match but physical tag mismatch, in which case the cache entry with the matching hint must be evicted so that cache accesses after the cache fill at this address will have just one hint match. Since virtual hints have fewer bits than virtual tags distinguishing them from one another, a virtually hinted cache suffers more conflict misses than a virtually tagged cache.

Perhaps the ultimate reduction of virtual hints can be found in the Pentium&nbsp;4 (Willamette and Northwood cores). In these processors the virtual hint is effectively two bits, and the cache is four-way set associative. Effectively, the hardware maintains a simple permutation from virtual address to cache index, so that no [[content-addressable memory]] (CAM) is necessary to select the right one of the four ways fetched.

===Page coloring===
{{Main article|Cache coloring}}

Large physically indexed caches (usually secondary caches) run into a problem: the operating system rather than the application controls which pages collide with one another in the cache. Differences in page allocation from one program run to the next lead to differences in the cache collision patterns, which can lead to very large differences in program performance. These differences can make it very difficult to get a consistent and repeatable timing for a benchmark run.

To understand the problem, consider a CPU with a 1&nbsp;MiB physically indexed direct-mapped level-2 cache and 4&nbsp;KiB virtual memory pages. Sequential physical pages map to sequential locations in the cache until after 256 pages the pattern wraps around. We can label each physical page with a color of 0–255 to denote where in the cache it can go. Locations within physical pages with different colors cannot conflict in the cache.

Programmers attempting to make maximum use of the cache may arrange their programs' access patterns so that only 1&nbsp;MiB of data need be cached at any given time, thus avoiding capacity misses. But they should also ensure that the access patterns do not have conflict misses. One way to think about this problem is to divide up the virtual pages the program uses and assign them virtual colors in the same way as physical colors were assigned to physical pages before. Programmers can then arrange the access patterns of their code so that no two pages with the same virtual color are in use at the same time. There is a wide literature on such optimizations (e.g. [[loop nest optimization]]), largely coming from the [[High Performance Computing|High Performance Computing (HPC)]] community.

The snag is that while all the pages in use at any given moment may have different virtual colors, some may have the same physical colors. In fact, if the operating system assigns physical pages to virtual pages randomly and uniformly, it is extremely likely that some pages will have the same physical color, and then locations from those pages will collide in the cache (this is the [[birthday paradox]]).

The solution is to have the operating system attempt to assign different physical color pages to different virtual colors, a technique called ''page coloring''. Although the actual mapping from virtual to physical color is irrelevant to system performance, odd mappings are difficult to keep track of and have little benefit, so most approaches to page coloring simply try to keep physical and virtual page colors the same.

If the operating system can guarantee that each physical page maps to only one virtual color, then there are no virtual aliases, and the processor can use virtually indexed caches with no need for extra virtual alias probes during miss handling. Alternatively, the OS can flush a page from the cache whenever it changes from one virtual color to another. As mentioned above, this approach was used for some early SPARC and RS/6000 designs.

The software page coloring technique has been used to effectively partition the shared Last level Cache (LLC) in multicore processors.<ref>{{Cite conference |last1=Lin |first1=Jiang |last2=Lu |first2=Qingda |last3=Ding |first3=Xiaoning |last4=Zhang |first4=Zhao |last5=Zhang |first5=Xiaodong |last6=Sadayappan |first6=P. |date=2008 |title=Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems |conference=IEEE 14th International Symposium on High Performance Computer Architecture |location=Salt Lake City, Utah |pages=367–378 |doi=10.1109/HPCA.2008.4658653}}</ref> This operating system-based LLC management in multicore processors has been adopted by Intel.<ref>{{Cite web|url=http://web.cse.ohio-state.edu/~zhang.574/OS-cache-software_intel_2010.pdf|title=Letter to Jiang Lin}}</ref>