Editing Algorithmic efficiency (section)

==== Caching and memory hierarchy ====
{{Further|Memory hierarchy}}
Modern computers can have relatively large amounts of memory (possibly gigabytes), so having to squeeze an algorithm into a confined amount of memory is not the kind of problem it used to be. However, the different types of memory and their relative access speeds can be significant:

* [[Processor register]]s, are the fastest memory with the least amount of space. Most direct computation on modern computers occurs with source and destination operands in registers before being updated to the cache, main memory and virtual memory if needed. On a [[CPU core|processor core]], there are typically on the order of hundreds of bytes or fewer of register availability, although a [[register file]] may contain more physical registers than [[Instruction set architecture|architectural]] registers defined in the instruction set architecture.
* [[CPU cache|Cache memory]] is the second fastest, and second smallest, available in the memory hierarchy. Caches are present in processors such as CPUs or GPUs, where they are typically implemented in [[Static random-access memory|static RAM]], though they can also be found in peripherals such as disk drives. Processor caches often have their own [[Cache hierarchy|multi-level hierarchy]]; lower levels are larger, slower and typically [[shared cache|shared]] between [[processor core]]s in [[multi-core processor]]s. In order to process operands in cache memory, a [[processor (computing)|processing unit]] must fetch the data from the cache, perform the operation in registers and write the data back to the cache. This operates at speeds comparable (about 2-10 times slower) with the CPU or GPU's [[arithmetic logic unit]] or [[floating-point unit]] if in the [[L1 cache]].<ref name="CompArc:QuantApp">{{cite book |last1=Hennessy |first1=John L |last2=Patterson |first2=David A |last3=Asanović |first3=Krste |author-link3=Krste Asanović |last4=Bakos |first4=Jason D |last5=Colwell |first5=Robert P |last6=Bhattacharjee |first6=Abhishek |last7=Conte |first7=Thomas M |last8=Duato |first8=José |last9=Franklin |first9=Diana |last10=Goldberg |first10=David |author-link11=Norman Jouppi|last11=Jouppi |first11=Norman P |last12=Li |first12=Sheng |last13=Muralimanohar |first13=Naveen |last14=Peterson |first14=Gregory D |last15=Pinkston |first15=Timothy Mark |last16=Ranganathan |first16=Prakash |last17=Wood |first17=David Allen |last18=Young |first18=Clifford |last19=Zaky |first19=Amr |title=Computer Architecture: a Quantitative Approach |date=2011 |publisher=Elsevier Science |isbn=978-0128119051 |edition=Sixth |language=en|oclc=983459758 }}</ref> It is about 10 times slower if there is an L1 [[cache miss]] and it must be retrieved from and written to the [[L2 cache]], and a further 10 times slower if there is an L2 cache miss and it must be retrieved from an [[L3 cache]], if present.
* [[Main memory|Main physical memory]] is most often implemented in [[dynamic RAM]] (DRAM). The main memory is much larger (typically [[gigabyte]]s compared to ≈8 [[megabyte]]s) than an L3 CPU cache, with read and write latencies typically 10-100 times slower.<ref name="CompArc:QuantApp" /> {{As of|2018}}, RAM is increasingly implemented [[system-on-chip|on-chip]] of processors, as CPU or [[GPU memory]].{{Citation needed |date=July 2024}}
* [[Paged memory]], often used for [[virtual memory]] management, is memory stored in [[secondary storage]] such as a [[hard disk]], and is an extension to the [[memory hierarchy]] which allows use of a potentially larger storage space, at the cost of much higher latency, typically around 1000 times slower than a [[cache miss]] for a value in RAM.<ref name="CompArc:QuantApp" /> While originally motivated to create the impression of higher amounts of memory being available than were truly available, virtual memory is more important in contemporary usage for its [[time-space tradeoff]] and enabling the usage of [[virtual machine]]s.<ref name="CompArc:QuantApp" /> Cache misses from main memory are called [[page fault]]s, and incur huge performance penalties on programs.

An algorithm whose memory needs will fit in cache memory will be much faster than an algorithm which fits in main memory, which in turn will be very much faster than an algorithm which has to resort to paging. Because of this, [[cache replacement policies]] are extremely important to high-performance computing, as are [[Cache-aware model|cache-aware programming]] and [[data alignment]]. To further complicate the issue, some systems have up to three levels of cache memory, with varying effective speeds. Different systems will have different amounts of these various types of memory, so the effect of algorithm memory needs can vary greatly from one system to another.

In the early days of electronic computing, if an algorithm and its data would not fit in main memory then the algorithm could not be used. Nowadays the use of virtual memory appears to provide much more memory, but at the cost of performance. Much higher speed can be obtained if an algorithm and its data fit in cache memory; in this case minimizing space will also help minimize time. This is called the [[principle of locality]], and can be subdivided into [[locality of reference]], [[spatial locality]], and [[temporal locality]]. An algorithm which will not fit completely in cache memory but which exhibits locality of reference may perform reasonably well.