Editing Sum-addressed decoder (section)

==Overview==
The L1 [[Cache (computing)|data cache]] should usually be in the most critical CPU resource, because few things improve [[instructions per cycle]] (IPC) as directly as a larger data cache, a larger data cache takes longer to access, and [[Instruction pipeline|pipelining]] the data cache makes IPC worse. One way of reducing the latency of the L1 data cache access is by fusing the address generation sum operation with the decode operation in the cache SRAM.

The address generation sum operation still must be performed, because other units in the memory pipe will use the resulting virtual address. That sum will be performed in parallel with the fused add/decode described here.

The most profitable recurrence to accelerate is a load, followed by a use of that load in a chain of integer operations leading to another load. Assuming that load results are bypassed with the same priority as integer results, then it's possible to summarize this recurrence as a load followed by another load—as if the program was following a linked list.

The rest of this page assumes an [[instruction set architecture]] (ISA) with a single addressing mode (register+offset), a virtually indexed data cache, and sign-extending loads that may be variable-width. Most [[RISC]] ISAs fit this description. In ISAs such as the [[Intel]] [[x86]], three or four inputs are summed to generate the virtual address. Multiple-input additions can be reduced to a two-input addition with carry save adders, and the remaining problem is as described below. The critical recurrence, then, is an [[Adder (electronics)|adder]], a [[Binary decoder|decoder]], the SRAM word line, the SRAM bit line(s), the sense amp(s), the byte steering [[multiplexer|muxes]], and the bypass muxes.

For this example, a direct-mapped 16&nbsp;[[Kibibyte|KB]] data cache which returns doubleword (8-byte) aligned values is assumed. Each line of the SRAM is 8 bytes, and there are 2048 lines, addressed by Addr[13:3]. The sum-addressed SRAM idea applies equally well to set associative caches.