Editing Soft error (section)

=== Correcting soft errors ===
{{see also|ECC memory}}

Designers can choose to accept that soft errors will occur, and design systems with appropriate error detection and correction to recover gracefully. Typically, a semiconductor memory design might use [[forward error correction]], incorporating redundant data into each [[Word (computer architecture)|word]] to create an [[error correcting code]]. Alternatively, [[roll-back error correction]] can be used, detecting the soft error with an [[Error detection and correction|error-detecting code]] such as [[parity bit|parity]], and rewriting correct data from another source. This technique is often used for [[write-through]] [[cache memory|cache memories]].

Soft errors in [[logic circuits]] are sometimes detected and corrected using the techniques of [[fault tolerance|fault tolerant design]]. These often include the use of redundant circuitry or computation of data, and typically come at the cost of circuit area, decreased performance, and/or higher power consumption. The concept of [[triple modular redundancy]] (TMR) can be employed to ensure very high soft-error reliability in logic circuits. In this technique, three identical copies of a circuit compute on the same data in parallel and outputs are fed into [[majority voting logic]], returning the value that occurred in at least two of three cases. In this way, the failure of one circuit due to soft error is discarded assuming the other two circuits operated correctly.  In practice, however, few designers can afford the greater than 200% circuit area and power overhead required, so it is usually only selectively applied.  Another common concept to correct soft errors in logic circuits is temporal (or time) redundancy, in which one circuit operates on the same data multiple times and compares subsequent evaluations for consistency.  This approach, however, often incurs performance overhead, area overhead (if copies of latches are used to store data), and power overhead, though is considerably more area-efficient than modular redundancy.

Traditionally, [[Dynamic random access memory|DRAM]] has had the most attention in the quest to reduce or work around soft errors, due to the fact that DRAM has comprised the majority-share of susceptible device surface area in desktop, and server computer systems (ref. the prevalence of ECC RAM in server computers).  Hard figures for DRAM susceptibility are hard to come by, and vary considerably across designs, fabrication processes, and manufacturers. 1980s technology 256 kilobit DRAMS could have clusters of five or six bits flip from a single [[alpha particle]]. Modern DRAMs have much smaller feature sizes, so the deposition of a similar amount of charge could easily cause many more bits to flip.

The design of error detection and correction circuits is helped by the fact that soft errors usually are localised to a very small area of a chip. Usually, only one cell of a memory is affected, although high energy events can cause a multi-cell upset. Conventional memory layout usually places one bit of many different correction words adjacent on a chip. So, even a ''multi-cell upset'' leads to only a number of separate ''[[Single event upset|single-bit upsets]]'' in multiple correction words, rather than a ''multi-bit upset'' in a single correction word. So, an error correcting code needs only to cope with a single bit in error in each correction word in order to cope with all likely soft errors. The term 'multi-cell' is used for upsets affecting multiple cells of a memory, whatever correction words those cells happen to fall in. 'Multi-bit' is used when multiple bits in a single correction word are in error.