Editing Soft error (section)

== Designing around soft errors ==

=== Soft error mitigation===
A designer can attempt to minimize the rate of soft errors by judicious device design, choosing the right semiconductor, package and substrate materials, and the right device geometry. Often, however, this is limited by the need to reduce device size and voltage, to increase operating speed and to reduce power dissipation. The susceptibility of devices to upsets is described in the industry using the [[JEDEC]] [[JESD-89]] standard.

One technique that can be used to reduce the soft error rate in digital circuits is called [[radiation hardening]]. This involves increasing the
capacitance at selected circuit nodes in order to increase its effective Q<sub>crit</sub> value. This reduces the range of particle energies
to which the logic value of the node can be upset.  Radiation hardening is often accomplished by increasing the size of transistors who share
a drain/source region at the node.  Since the area and power overhead of radiation hardening can be restrictive to design, the technique is often applied selectively to nodes which are predicted to have the highest probability of resulting in soft errors if struck. Tools and models that can
predict which nodes are most vulnerable are the subject of past and current research in the area of soft errors.

=== Detecting soft errors===
There has been  work addressing soft errors in processor and memory resources using both hardware and software techniques. Several research efforts addressed soft errors by proposing error detection and recovery via hardware-based redundant multi-threading.<ref name="ReinhardtMukherjee2000">{{cite journal |last1=Reinhardt |first1=Steven K. |last2=Mukherjee |first2=Shubhendu S. |title=Transient fault detection via simultaneous multithreading |journal=ACM SIGARCH Computer Architecture News |volume=28 |issue=2 |date=2000 |pages=25–36 |issn=0163-5964 |doi=10.1145/342001.339652|citeseerx=10.1.1.112.37}}</ref><ref name="MukherjeeKontz2002">{{cite journal |last1=Mukherjee |first1=Shubhendu S. |last2=Kontz |first2=Michael |last3=Reinhardt |first3=Steven K. |title=Detailed design and evaluation of redundant multithreading alternatives |journal=ACM SIGARCH Computer Architecture News |volume=30 |issue=2 |date=2002 |pages=99 |issn=0163-5964 |doi=10.1145/545214.545227 |citeseerx=10.1.1.13.2922|s2cid=1909214 }}</ref><ref name="VijaykumarPomeranz2002">{{cite journal |last1=Vijaykumar |first1=T. N. |last2=Pomeranz |first2=Irith|author2-link= Irith Pomeranz |last3=Cheng |first3=Karl |title=Transient-fault recovery using simultaneous multithreading |journal=ACM SIGARCH Computer Architecture News |volume=30 |issue=2 |date=2002 |pages=87 |issn=0163-5964 |doi=10.1145/545214.545226|s2cid=2270600 }}</ref>
These approaches used special hardware to replicate an application execution to identify errors in the output, which increased hardware design complexity and cost including high performance overhead. Software-based soft error tolerant schemes, on the other hand, are flexible and can be applied on commercial off-the-shelf microprocessors. Many works propose compiler-level instruction replication and result checking for soft error detection.
<ref name="oh2002error">{{cite journal |last1=Nahmsuk |first1=Oh |last2=Shirvani |first2=Philip P. |last3=McCluskey |first3=Edward J. |title= Error detection by duplicated instructions in super-scalar processors |journal=IEEE Transactions on Reliability |volume=51 |date=2002 |pages=63–75 |doi=10.1109/24.994913}}</ref><ref name="reis2005swift">{{cite book |last1=Reis A. |first1=George A. |title=International Symposium on Code Generation and Optimization |last2=Chang |first2=Jonathan |last3=Vachharajani |first3=Neil |last4=Rangan |first4=Ram |last5=August |first5=David I. |chapter=SWIFT: Software implemented fault tolerance |location=Proceedings of the international symposium on Code generation and optimization |date=2005 |pages=243–254 |doi=10.1109/CGO.2005.34 |isbn=978-0-7695-2298-2 |citeseerx=10.1.1.472.4177|s2cid=5746979 }}</ref> 
<ref name="Didehban2016nZDC">{{citation |last1=Didehban |first1=Moslem |last2=Shrivastava |first2=Aviral |title=Proceedings of the 53rd Annual Design Automation Conference |chapter=NZDC: A compiler technique for near zero silent data corruption |date=2016 |publisher=ACM |location=Proceedings of the 53rd Annual Design Automation Conference (DAC) |page=48 |doi=10.1145/2897937.2898054 |isbn=9781450342360|s2cid=5618907 }}</ref>

=== Correcting soft errors ===
{{see also|ECC memory}}

Designers can choose to accept that soft errors will occur, and design systems with appropriate error detection and correction to recover gracefully. Typically, a semiconductor memory design might use [[forward error correction]], incorporating redundant data into each [[Word (computer architecture)|word]] to create an [[error correcting code]]. Alternatively, [[roll-back error correction]] can be used, detecting the soft error with an [[Error detection and correction|error-detecting code]] such as [[parity bit|parity]], and rewriting correct data from another source. This technique is often used for [[write-through]] [[cache memory|cache memories]].

Soft errors in [[logic circuits]] are sometimes detected and corrected using the techniques of [[fault tolerance|fault tolerant design]]. These often include the use of redundant circuitry or computation of data, and typically come at the cost of circuit area, decreased performance, and/or higher power consumption. The concept of [[triple modular redundancy]] (TMR) can be employed to ensure very high soft-error reliability in logic circuits. In this technique, three identical copies of a circuit compute on the same data in parallel and outputs are fed into [[majority voting logic]], returning the value that occurred in at least two of three cases. In this way, the failure of one circuit due to soft error is discarded assuming the other two circuits operated correctly.  In practice, however, few designers can afford the greater than 200% circuit area and power overhead required, so it is usually only selectively applied.  Another common concept to correct soft errors in logic circuits is temporal (or time) redundancy, in which one circuit operates on the same data multiple times and compares subsequent evaluations for consistency.  This approach, however, often incurs performance overhead, area overhead (if copies of latches are used to store data), and power overhead, though is considerably more area-efficient than modular redundancy.

Traditionally, [[Dynamic random access memory|DRAM]] has had the most attention in the quest to reduce or work around soft errors, due to the fact that DRAM has comprised the majority-share of susceptible device surface area in desktop, and server computer systems (ref. the prevalence of ECC RAM in server computers).  Hard figures for DRAM susceptibility are hard to come by, and vary considerably across designs, fabrication processes, and manufacturers. 1980s technology 256 kilobit DRAMS could have clusters of five or six bits flip from a single [[alpha particle]]. Modern DRAMs have much smaller feature sizes, so the deposition of a similar amount of charge could easily cause many more bits to flip.

The design of error detection and correction circuits is helped by the fact that soft errors usually are localised to a very small area of a chip. Usually, only one cell of a memory is affected, although high energy events can cause a multi-cell upset. Conventional memory layout usually places one bit of many different correction words adjacent on a chip. So, even a ''multi-cell upset'' leads to only a number of separate ''[[Single event upset|single-bit upsets]]'' in multiple correction words, rather than a ''multi-bit upset'' in a single correction word. So, an error correcting code needs only to cope with a single bit in error in each correction word in order to cope with all likely soft errors. The term 'multi-cell' is used for upsets affecting multiple cells of a memory, whatever correction words those cells happen to fall in. 'Multi-bit' is used when multiple bits in a single correction word are in error.