Editing Locality of reference (section)

== Spatial and temporal locality usage ==

=== Hierarchical memory ===
{{main|Memory hierarchy}}

Hierarchical memory is a hardware optimization that takes the benefits of spatial and temporal locality and can be used on several levels of the memory hierarchy. [[Paging]] obviously benefits from temporal and spatial locality.  A cache is a simple example of exploiting temporal locality, because it is a specially designed, faster but smaller memory area, generally used to keep recently referenced data and data near recently referenced data, which can lead to potential performance increases.

Data elements in a cache do not necessarily correspond to data elements that are spatially close in the main memory; however, data elements are brought into cache one [[cache line]] at a time. This means that spatial locality is again important: if one element is referenced, a few neighboring elements will also be brought into cache. Finally, temporal locality plays a role on the lowest level, since results that are referenced very closely together can be kept in the [[Processor register|machine registers]]. Some programming languages (such as [[C (programming language)|C]]) allow the programmer to suggest that certain variables be kept in registers.

Data locality is a typical memory reference feature of regular programs (though many irregular memory access patterns exist). It makes the hierarchical memory layout profitable. In computers, memory is divided into a hierarchy in order to speed up data accesses.  The lower levels of the memory hierarchy tend to be slower, but larger.  Thus, a program will achieve greater performance if it uses memory while it is cached in the upper levels of the memory hierarchy and avoids bringing other data into the upper levels of the hierarchy that will displace data that will be used shortly in the future.  This is an ideal, and sometimes cannot be achieved.

Typical memory hierarchy (access times and cache sizes are approximations of typical values used {{As of|2013|lc=on}} for the purpose of discussion; actual values and actual numbers of levels in the hierarchy vary):
* [[CPU register]]s (8–256 registers) &ndash; immediate access, with the speed of the innermost core of the processor
* L1 [[CPU cache]]s (32&nbsp;KB to 512&nbsp;[[kilobyte|KB]]) &ndash; fast access, with the speed of the innermost memory bus owned exclusively by each core
* L2 CPU caches (128&nbsp;KB to 24&nbsp;[[megabyte|MB]]) &ndash; slightly slower access, with the speed of the [[memory bus]] shared between twins of cores
* L3 CPU caches (2&nbsp;MB up to a max of 64&nbsp;[[megabyte|MB]]) &ndash; even slower access, with the speed of the memory bus shared between even more cores of the same processor
* Main [[physical memory]] ([[random-access memory|RAM]]) (256&nbsp;MB to 64&nbsp;[[gigabyte|GB]]) &ndash; slow access, the speed of which is limited by the spatial distances and general hardware interfaces between the processor and the memory modules on the [[motherboard]]
* Disk ([[virtual memory]], [[file system]]) (1&nbsp;GB to 256&nbsp;[[terabyte|TB]]) &ndash; very slow, due to the narrower (in bit width), physically much longer data channel between the main board of the computer and the disk devices, and due to the extraneous software protocol needed on the top of the slow hardware interface
* Remote memory (other computers or the cloud) (practically unlimited) &ndash; speed varies from very slow to extremely slow

Modern machines tend to read blocks of lower memory into the next level of the memory hierarchy.  If this displaces used memory, the [[operating system]] tries to predict which data will be accessed least (or latest) and move it down the memory hierarchy.  Prediction algorithms tend to be simple to reduce hardware complexity, though they are becoming somewhat more complicated.

=== Matrix multiplication ===
A common example is [[Matrix multiplication algorithm|matrix multiplication]]:

<syntaxhighlight lang="pascal" line="1">
for i in 0..n
  for j in 0..m
    for k in 0..p
      C[i][j] = C[i][j] + A[i][k] * B[k][j];
</syntaxhighlight>

By switching the looping order for <code>j</code> and <code>k</code>, the speedup in large matrix multiplications becomes dramatic, at least for languages that put contiguous array elements in the last dimension.  This will not change the mathematical result, but it improves efficiency.  In this case, "large" means, approximately, more than 100,000 elements in each matrix, or enough addressable memory such that the matrices will not fit in L1 and L2 caches.

<syntaxhighlight lang="pascal" line="1">
for i in 0..n
  for k in 0..p
    for j in 0..m
      C[i][j] = C[i][j] + A[i][k] * B[k][j];
</syntaxhighlight>

The reason for this speedup is that in the first case, the reads of <code>A[i][k]</code> are in cache (since the <code>k</code> index is the contiguous, last dimension), but <code>B[k][j]</code> is not, so there is a cache miss penalty on <code>B[k][j]</code>. <code>C[i][j]</code> is irrelevant, because it can be [[Loop-invariant_code_motion|hoisted]] out of the inner loop -- the loop variable there is <code>k</code>.

<syntaxhighlight lang="pascal" line="1">
for i in 0..n
  for j in 0..m
    temp = C[i][j]
    for k in 0..p
      temp = temp + A[i][k] * B[k][j];
    C[i][j] = temp
</syntaxhighlight>

In the second case, the reads and writes of <code>C[i][j]</code> are both in cache, the reads of <code>B[k][j]</code> are in cache, and the read of <code>A[i][k]</code> can be hoisted out of the inner loop.

<syntaxhighlight lang="pascal" line="1">
for i in 0..n
  for k in 0..p
    temp = A[i][k]
    for j in 0..m
      C[i][j] = C[i][j] + temp * B[k][j];
</syntaxhighlight>

Thus, the second example has no cache miss penalty in the inner loop while the first example has a cache penalty.

On a year 2014 processor, the second case is approximately five times faster than the first case, when written in [[C (programming language)|C]] and compiled with <code>gcc -O3</code>.  (A careful examination of the disassembled code shows that in the first case, [[GNU Compiler Collection|GCC]] uses [[SIMD]] instructions and in the second case it does not, but the cache penalty is much worse than the SIMD gain.){{Citation needed|date=September 2014}}

Temporal locality can also be improved in the above example by using a technique called [[Loop blocking|blocking]]. The larger matrix can be divided into evenly sized sub-matrices, so that the smaller blocks can be referenced (multiplied) several times while in memory. Note that this example works for square matrices of dimensions SIZE x SIZE, but it can easily be extended for arbitrary matrices by substituting SIZE_I, SIZE_J and SIZE_K where appropriate.

<syntaxhighlight lang="pascal" line="1">
for (ii = 0; ii < SIZE; ii += BLOCK_SIZE)
  for (kk = 0; kk < SIZE; kk += BLOCK_SIZE)
    for (jj = 0; jj < SIZE; jj += BLOCK_SIZE)
      maxi = min(ii + BLOCK_SIZE, SIZE);
      for (i = ii; i < maxi; i++)
        maxk = min(kk + BLOCK_SIZE, SIZE);
        for (k = kk; k < maxk; k++)
          maxj = min(jj + BLOCK_SIZE, SIZE);
          for (j = jj; j < maxj; j++)
            C[i][j] = C[i][j] + A[i][k] * B[k][j];
</syntaxhighlight>

The temporal locality of the above solution is provided because a block can be used several times before moving on, so that it is moved in and out of memory less often.  Spatial locality is improved because elements with consecutive memory addresses tend to be pulled up the memory hierarchy together.