A man wearing glasses standing in front of a blackboard with equations.

Reliability Equations for Functional Safety

This is one for the math geeks who might even be able to provide additional insights and corrections. Even if you are not a math geek, I believe everybody in functional safety needs to know something about these equations. At some point, you will have to call upon one or more of them. I will try and give an intuitive feel for the equations so hopefully it won’t be too bad. I am sure there are errors in the strict mathematical sense in what I have below, but it is hopefully accurate enough to get the meaning across and leave you with a good feeling about the topic.

The equations shown here are obviously for random hardware failures as opposed to systematic failure modes since systematic failure modes will occur with a probability of 1 if the right conditions arise.

This blog is inspired by sub-clause 6.2.2 of IEC 62308:2006. However, it is also well covered in the book “Reliability Maintainability and Risk” which is available on Amazon for around Euro 40 and contains lots of other interesting discussions.

 Figure 1 - front page of IEC 62308

Figure 1 - front page of IEC 62308

The scope of IEC 62308 states “This International Standard describes reliability assessment methods for items. It applies to mission, safety, and business critical, high integrity, and complex electronic items. It contains information on why reliability is required and how and where the results of the assessment would be used. Finally, it details how the method of reliability assessment would be chosen and the data required supporting the assessment.”.

So, let’s get stuck in with equation (1) which looks the most intimidating of the lot. Who doesn’t like a mix of integrals and exponents?

 Figure 2 - The first equation for the reliability over time of an item

Figure 2 - The first equation for the reliability over time of an item

This first equation gives the reliability of an item assuming the failure rate is time dependent. At time 0 R(t)=1 and at time = infinity the reliability is R(t)=0. R(t) could be said to show availability, the fraction of devices still surviving at time t which is why the integration coefficients are from t to infinity representing all those devices yet to fail which reminds me a bit of the joke about good health being simply the slowest speed at which you can die. The oppositive of R(t) is F(t) representing the fraction of devices that have failed up to time t which is then given by F(t)=1-R(t) where the “F” stands for failure.

λ(t) represents the failure rate of the device which may or may not be constant over time. For functional safety we normally assume λ(t) is a constant value λ so let’s do that which allows us to simplify and explore the equation.  We then have R(t)=exp(-λt) which makes it much more obvious that if t=0 R(t)=exp (0) =1 and t=∞ we have R(t)=exp(-∞) =0 so all good.

 Figure 3 - a plot showing the shape of R(t) and F(t) over approx. 20 years for a constant failure rate

Figure 3 - a plot showing the shape of R(t) and F(t) over approx. 20 years for a constant failure rate

Based on its use in the equations above λ(t) represents the chances of a device failing in the interval t to t+dt given that it has survived to time t. The fact that it has to survive to time t means this is referred to as a conditional probability. The condition is that it must have survived until then to fail in that interval.

 For functional safety, dt is usually 1 hour so that λ(8760) represents the probability of a device failing in the first hour of its second year of operation provided it has managed to survive for at least 1 year.

Stating that the failure rate is constant then means that if a device has survived for 1 year it has a chance of failure in the next hour of λ, and if a device has survived for 20 years it still has a chance of failing in the next hour of λ, with λ being independent of how long it has already been in operation.

Note – for functional safety we are often more interested in the dangerous undetected failure rate than the actual failure rate. Devices failing to a safe state are mostly somebody else’s problem. Therefore λ is more likely to actually be λD the dangerous failure rate or even λDU the dangerous undetected failure rate.

An aside – this means that if you replace a fully working device after say 10 years with a perfectly new device of the same design you have gained nothing since both will have a failure rate of λ in the next hour.

Also, it’s worth reminding you that if a device has a failure rate of 1e-9/h it doesn’t mean it will last a billion years. It rather means that if you have a billion units operating for 1 hour you can expect 1 of them to fail in that hour. Similarly, if you have only 100k units and they operate for a year you can expect one of them to fail at some point in that year.

Let’s move on to the second equation. Below, f(t) is shown as the negative of the differential of the reliability function which for a constant failure rate means R(t) has a smoothly declining shape to it (see Figure 3 above). That means that the negative of such a differential will be positive (not sure what a negative failure rate would mean).  In strict mathematical language f(t) is the failure probability density function. It also represents the failure in the interval t to t+dt unconditionally i.e. no requirement to have survived until time t. In contrast to λ(t), even with a constant failure rate f(t) after 1 year will be a lot higher than the f(t) after 20 years since most of the devices will have already failed well before then. By the time it gets to 20 years, it has had to survive all those other years so the chances that it is still around to fail in the next hour is very small. If 0.9 is the fraction of devices expected to survive for 1 year and if the fraction expected to survive at 1 year + 1 hour is 0.89 then f(t)= (0.9-0.89)/1=0.01 at 1 year i.e., delta R/delta t.  Mathematically this is shown below.

 Figure 4 - The second equation for failure probability density function

Figure 4 - The second equation for failure probability density function

Note – quick refresh. To differentiate y=exp(f(x)) use dy/dx = d(f(x)/dx))* exp(f(x))

Let’s, take the constant failure rate example so that R(t)=exp(-λt) then d(R(t))/d(t)= -λexp(-λt) =-λR(t) so that f(t)=λR(t). Since R(t)=1 at time 0(all devices are still surviving) and R(t)=0 at time infinity then f(t) goes from λ to 0 overtime where λ(t) is a constant λ.

The third equation below then follows from replacing λ with λ(t) and rearranging the terms.

 Figure 5 - third equation for the instantaneous failure rate

Figure 5 - third equation for the instantaneous failure rate

The plot below shows the conditional failure rate, λ(t), and the un-conditional failure rate, f(t), plotted over a time of approximately 20 years for a constant λ(t) of 2e-6/h. In theory λ(t) is constant to infinity but f(t) “quickly” falls to zero as no devices are left to fail (in a probabilistic sense).

 Figure 6 plot of the conditional and unconditional failure rates

Figure 6 plot of the conditional and unconditional failure rates

Finally, we are left with the relationship between reliability and mean time to failure. Looking at the Smith book for guidance consider R(t) as follows. If there are N items and NS(t) is the number of devices still surviving at time t, R(t)=NS(t)/N.

Note – NS stands for Number Surviving.

Suppose we want to know how many operating hours we can expect for the N units before all the devices have failed. In every time interval the total operating hours increase by NS(t)*dt. Therefore, the total operating hours are given by the integral from 0 to ∞ of NS(t) wrt dt. The average expected time for one device is then that integral divided by N, but NS(t)/N = R(t) gives the equation below.

 Figure 7 - fourth equation showing the average mean time to failure

Figure 7 - fourth equation showing the average mean time to failure

The derivation makes it clear what MTTF is. If you have 5 devices and they survive for 100 hours, 1000 hours, 2000 hours, 5000 hours, and 10000 hours then the total operating time for the 5 devices is 18,100 hours, and the MTTF = 18100/5 = 3620 hours.

Finally, let's summarize the equations from IEC 62308 for the assumption of a constant failure rate as normally found in functional safety.

 Figure 8 - above equations if lambda is a constant

Figure 8 - above equations if lambda is a constant

Not stated above, but it can be derived from the math, if the MTTF is X and the failure rate is constant then 63.2% of the items will have failed after X hours of operation. Also worth stating is that for an assumed constant failure rate 1 device operating for 1 million hours is the same as 1 million units operating for 1 hour.

You can also manipulate the equations in other ways. For instance, if an IC has a constant failure rate of 1000 FIT what is the chance that it will survive for 20 years? R(t)=exp(-1e-6*20*8760)=0.84.

Hopefully, this blog will give you the confidence not to be afraid of the maths around reliability. Hopefully, it will empower you to try the equations.

Hopefully, I didn’t mess up too badly in my above mathematical endeavors.

Hopefully, no real mathematician ever looks too closely at this blog.

Some Useful Links

For a previous blog on the reliability predictions for integrated circuits – see here

For a previous blog on random vs systematic failure modes – see here

The Analog Devices reliability handbook has some good stuff on semiconductor reliability – see here

Reliability data based on HTOL (high temperature operating life) testing in the lab for all ADI released products – see here

For the full set of blogs in this series – see here