Failure rate

Template:Short description Failure rate is the frequency with which any system or component fails, expressed in failures per unit of time. It thus depends on the system conditions, time interval, and total number of systems under study.<ref name='macdiarmid'>* Template:Cite book</ref> It can describe electronic, mechanical, or biological systems, in fields such as systems and reliability engineering, medicine and biology, or insurance and finance. It is usually denoted by the Greek letter <math>\lambda</math> (lambda).

In real-world applications, the failure probability of a system usually differs over time; failures occur more frequently in early-life ("burning in"), or as a system ages ("wearing out"). This is known as the bathtub curve, where the middle region is called the "useful life period".

Mean time between failures (MTBF)Edit

The mean time between failures (MTBF, <math>1/\lambda</math>) is often reported instead of the failure rate, as numbers such as "2,000 hours" are more intuitive than numbers such as "0.0005 per hour".

However, this is only valid if the failure rate <math>\lambda(t)</math> is actually constant over time, such as within the flat region of the bathtub curve. In many cases where MTBF is quoted, it refers only to this region; thus it cannot be used to give an accurate calculation of the average lifetime of a system, as it ignores the "burn-in" and "wear-out" regions.

MTBF appears frequently in engineering design requirements, and governs the frequency of required system maintenance and inspections. A similar ratio used in the transport industries, especially in railways and trucking, is "mean distance between failures" - allowing maintenance to be scheduled based on distance travelled, rather than at regular time intervals.

Mathematical definitionEdit

The simplest definition of failure rate <math>\lambda</math> is simply the number of failures <math>\Delta n</math> per time interval <math>\Delta t</math>:

<math>\lambda = \frac{\Delta n}{\Delta t}</math>

which would depend on the number of systems under study, and the conditions over the time period.

Failures over timeEdit

File:Exponential distribution cdf.svg

Cumulative distribution function for the exponential distribution, often used as the cumulative failure function <math>F(t).</math>

To accurately model failures over time, a cumulative failure distribution, <math>F(t)</math> must be defined, which can be any cumulative distribution function (CDF) that gradually increases from <math>0</math> to <math>1</math>. In the case of many identical systems, this may be thought of as the fraction of systems failing over time <math>t</math>, after all starting operation at time <math>t=0</math>; or in the case of a single system, as the probability of the system having its failure time <math>T</math> before time <math>t</math>:

<math>F(t) = \operatorname{P}(T\le t).</math>

As CDFs are defined by integrating a probability density function, the failure probability density <math>f(t)</math> is defined such that:

File:Exponential pdf.svg

Exponential probability functions, often used as the failure probability density <math>f(t)</math>.

where <math>\tau</math> is a dummy integration variable. Here <math>f(t)</math> can be thought of as the instantaneous failure rate, i.e. the fraction of failures per unit time, as the size of the time interval <math>\Delta t</math> tends towards <math>0</math>:

<math>f(t) = \lim_{\Delta t \to 0^+} \frac{P(t<T\leq t + \Delta t)}{\Delta t}. </math>

Hazard rateEdit

A concept closely-related but different<ref name="todinov">Template:Cite book</ref> to instantaneous failure rate <math>f(t)</math> is the hazard rate (or Template:Visible anchor), <math>h(t)</math>.

In the many-system case, this is defined as the proportional failure rate of the systems still functioning at time <math>t</math> (as opposed to <math>f(t)</math>, which is the expressed as a proportion of the initial number of systems).

For convenience we first define the reliability (or survival function) as:

then the hazard rate is simply the instantaneous failure rate, scaled by the fraction of surviving systems at time <math>t</math>:

In the probabilistic sense, for a single system this can be interpreted as how much the conditional probability of failure time <math>T</math> within the time interval <math>t</math> to <math>t + \Delta t</math> changes, given that the system or component has already survived to time <math>t</math>:

<math>h(t) = \lim_{\Delta t \to 0^+} \frac{P(t < T \leq t + \Delta t \mid T>t)}{\Delta t}.</math>

Conversion to cumulative failure rateEdit

To convert between <math>h(t)</math> and <math>F(t)</math>, we can solve the differential equation

with initial condition <math>R(0)=1</math>, which yields<ref name="todinov" />

<math>F(t) = 1 - \exp{\left(-\int_0^t h(\tau) d\tau \right)}.</math>

Thus for a collection of identical systems, only one of hazard rate <math>h(t)</math>, failure probability density <math>f(t)</math>, or cumulative failure distribution <math>F(t)</math> need be defined.

Confusion can occur as the notation <math>\lambda(t)</math> for "failure rate" often refers to the function <math>h(t)</math> rather than <math>f(t).</math><ref>Template:Cite book</ref>

Constant hazard rate modelEdit

There are many possible functions that could be chosen to represent failure probability density <math>f(t)</math> or hazard rate <math>h(t)</math>, based on empirical or theoretical evidence, but the most common and easily-understandable choice is to set

<math>f(t) = \lambda e^{-\lambda t}</math>,

an exponential function with scaling constant <math>\lambda</math>. As seen in the figures above, this represents a gradually decreasing failure probability density.

The CDF <math>F(t)</math> is then calculated as:

<math>F(t)=\int_{0}^{t} \lambda e^{-\lambda \tau}\, d\tau = 1 - e^{-\lambda t}, \!</math>

which can be seen to gradually approach <math>1</math> as <math>t \to \infty,</math> representing the fact that eventually all systems under study will fail.

The hazard rate function is then:

<math>h(t) = \frac{f(t)}{R(t)} = \frac{\lambda e^{-\lambda t}}{e^{-\lambda t}} = \lambda .</math>

In other words, in this particular case only, the hazard rate is constant over time.

This illustrates the difference in hazard rate and failure probability density - as the number of systems surviving at time <math>t > 0</math> gradually reduces, the total failure rate also reduces, but the hazard rate remains constant. In other words, the probabilities of each individual system failing do not change over time as the systems age - they are "memory-less".

Other modelsEdit

File:Loglogistichaz.svg

Hazard function <math>h(t)</math> plotted for a selection of log-logistic distributions, any of which could be used as a hazard rate, depending on the system under study.

For many systems, a constant hazard function may not be a realistic approximation; the chance of failure of an individual component may depend on its age. Therefore, other distributions are often used.

For example, the deterministic distribution increases hazard rate over time (for systems where wear-out is the most important factor), while the Pareto distribution decreases it (for systems where early-life failures are more common). The commonly-used Weibull distribution combines both of these effects, as do the log-normal and hypertabastic distributions.

After modelling a given distribution and parameters for <math>h(t)</math>, the failure probability density <math>f(t)</math> and cumulative failure distribution <math>F(t)</math> can be predicted using the given equations.

Measuring failure rateEdit

Failure rate data can be obtained in several ways. The most common means are:

Estimation: From field failure rate reports, statistical analysis techniques can be used to estimate failure rates. For accurate failure rates the analyst must have a good understanding of equipment operation, procedures for data collection, the key environmental variables impacting failure rates, how the equipment is used at the system level, and how the failure data will be used by system designers.
Historical data about the device or system under consideration: Many organizations maintain internal databases of failure information on the devices or systems that they produce, which can be used to calculate failure rates for those devices or systems. For new devices or systems, the historical data for similar devices or systems can serve as a useful estimate.
Government and commercial failure rate data: Handbooks of failure rate data for various components are available from government and commercial sources. MIL-HDBK-217F, Reliability Prediction of Electronic Equipment, is a military standard that provides failure rate data for many military electronic components. Several failure rate data sources are available commercially that focus on commercial components, including some non-electronic components.
Prediction: Time lag is one of the serious drawbacks of all failure rate estimations. Often by the time the failure rate data are available, the devices under study have become obsolete. Due to this drawback, failure-rate prediction methods have been developed. These methods may be used on newly designed devices to predict the device's failure rates and failure modes. Two approaches have become well known, Cycle Testing and FMEDA.
Life Testing: The most accurate source of data is to test samples of the actual devices or systems in order to generate failure data. This is often prohibitively expensive or impractical, so that the previous data sources are often used instead.
Cycle Testing: Mechanical movement is the predominant failure mechanism causing mechanical and electromechanical devices to wear out. For many devices, the wear-out failure point is measured by the number of cycles performed before the device fails, and can be discovered by cycle testing. In cycle testing, a device is cycled as rapidly as practical until it fails. When a collection of these devices are tested, the test will run until 10% of the units fail dangerously.
FMEDA: Failure modes, effects, and diagnostic analysis (FMEDA) is a systematic analysis technique to obtain subsystem / product level failure rates, failure modes and design strength. The FMEDA technique considers:

All components of a design,
The functionality of each component,
The failure modes of each component,
The effect of each component failure mode on the product functionality,
The ability of any automatic diagnostics to detect the failure,
The design strength (de-rating, safety factors) and
The operational profile (environmental stress factors).

Given a component database calibrated with field failure data that is reasonably accurate,<ref>Template:Cite book</ref> the method can predict product level failure rate and failure mode data for a given application. The predictions have been shown to be more accurate<ref>Template:Cite book</ref> than field warranty return analysis or even typical field failure analysis given that these methods depend on reports that typically do not have sufficient detail information in failure records.<ref>W. M. Goble, "Field Failure Data – the Good, the Bad and the Ugly," exida, Sellersville, PA [1]</ref>

ExamplesEdit

Decreasing failure ratesEdit

A decreasing failure rate describes cases where early-life failures are common<ref>Template:Cite book</ref> and corresponds to the situation where <math>h(t)</math> is a decreasing function.

This can describe, for example, the period of infant mortality in humans, or the early failure of a transistors due to manufacturing defects.

Decreasing failure rates have been found in the lifetimes of spacecraft - Baker and Baker commenting that "those spacecraft that last, last on and on."<ref>Template:Cite journal</ref><ref>Template:Cite book</ref>

The hazard rate of aircraft air conditioning systems was found to have an exponentially decreasing distribution.<ref name="proschan">Template:Cite journal</ref>

Renewal processesEdit

In special processes called renewal processes, where the time to recover from failure can be neglected, the likelihood of failure remains constant with respect to time.

For a renewal process with DFR renewal function, inter-renewal times are concave.Template:Clarify<ref name="brown1980" /><ref name="shanthikumar">Template:Cite journal</ref> Brown conjectured the converse, that DFR is also necessary for the inter-renewal times to be concave,<ref>Template:Cite journal</ref> however it has been shown that this conjecture holds neither in the discrete case<ref name="shanthikumar" /> nor in the continuous case.<ref>Template:Cite journal</ref>

Coefficient of variationEdit

When the failure rate is decreasing the coefficient of variation is ⩾ 1, and when the failure rate is increasing the coefficient of variation is ⩽ 1.Template:Clarify<ref>Template:Cite journal</ref> Note that this result only holds when the failure rate is defined for all t ⩾ 0<ref>Template:Cite book</ref> and that the converse result (coefficient of variation determining nature of failure rate) does not hold.

UnitsEdit

Failure rates can be expressed using any measure of time, but hours is the most common unit in practice. Other units, such as miles, revolutions, etc., can also be used in place of "time" units.

Failure rates are often expressed in engineering notation as failures per million, or 10⁻⁶, especially for individual components, since their failure rates are often very low.

The Failures In Time (FIT) rate of a device is the number of failures that can be expected in one billion (10⁹) device-hours of operation<ref> Xin Li; Michael C. Huang; Kai Shen; Lingkun Chu. "A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility". 2010. p. 6. </ref> (e.g. 1,000 devices for 1,000,000 hours, or 1,000,000 devices for 1,000 hours each, or some other combination). This term is used particularly by the semiconductor industry.

Combinations of failure typesEdit

If a complex system consists of many parts, and the failure of any single part means the failure of the entire system, then the total failure rate is simply the sum of the individual failure rates of its parts

<math>\lambda_S = \lambda_{P1} + \lambda_{P2} + \ldots</math>

however, this assumes that the failure rate <math>\lambda(t)</math> is constant, and that the units are consistent (e.g. failures per million hours), and not expressed as a ratio or as probability densities. This is useful to estimate the failure rate of a system when individual components or subsystems have already been tested.<ref> "Reliability Basics". 2010. </ref><ref>Vita Faraci. "Calculating Failure Rates of Series/Parallel Networks" Template:Webarchive. 2006.</ref>

Adding "redundant" components to eliminate a single point of failure may thus actually increase the failure rate, however reduces the "mission failure" rate, or the "mean time between critical failures" (MTBCF).<ref> "Mission Reliability and Logistics Reliability: A Design Paradox". </ref>

Combining failure or hazard rates that are time-dependent is more complicated. For example, mixtures of Decreasing Failure Rate (DFR) variables are also DFR.<ref name="brown1980">Template:Cite journal</ref> Mixtures of exponentially distributed failure rates are hyperexponentially distributed.

Simple exampleEdit

Suppose it is desired to estimate the failure rate of a certain component. Ten identical components are each tested until they either fail or reach 1,000 hours, at which time the test is terminated. A total of 7,502 component-hours of testing is performed, and 6 failures are recorded.

The estimated failure rate is:

<math>\frac{6\text{ failures}}{7502\text{ hours}} = 0.0007998\, \frac{\text{failures}}{\text{hour}} </math>

which could also be expressed as a MTBF of 1,250 hours, or approximately 800 failures for every million hours of operation.

ReferencesEdit

Template:Reflist

External linksEdit

Bathtub curve issues Template:Webarchive, ASQC
Fault Tolerant Computing in Industrial Automation Template:Webarchive by Hubert Kirrmann, ABB Research Center, Switzerland

Template:Statistics