The Math Behind Proven in Use

The Math Behind Proven in Use

I’m going to start this blog with a Steven Hawking quote, “Someone told me that each equation I included in the book would halve the sales.” Nevertheless, I decided to proceed with this topic but there is only one equation and I will take the 50% hit on the number of readers. Stick with me, the math is not too bad.

IEC 61508 and ISO 26262 both offer “proven in use” as an alternate path to claim compliance. In IEC 61508 terminology proven-in-use is referred to as route 2S. The more common route 1S means the item was developed in compliance to all applicable requirements of the standard. Route 2S is used when the item wasn’t developed in compliance to IEC 61508 which nonetheless has a lot of operation experience available to indicate its safety. It would seem reasonable that something which has operated for a long time and never shown any problems is good enough to use in a safety system. That at least is the premise behind route 2S. The route is available to both hardware and software but in this case, I will concentrate more on the hardware compliance.

Below are the most important tables from both IEC 61508 and ISO 26262. While IEC 61508 hints at the math behind the tables this blog might save you having to figure it out for yourself.  I personally always like to have an understanding of anything I apply. Otherwise it is very easy to misapply.  For ISO 26262 there is little justification given and this blog could be valuable to anybody who wants to better understand the standard. This blog will summarize my experience of trying to justify the numbers and equations behind proven-in-use. The blog will not discuss the merits of applying route 2S to software vs hardware, it just looks at the numbers and the math.

Figure 1 - the relevant table from IEC 61508-7:2010 Annex D.

The equivalent of this from ISO 26262 is given below.

Figure 2 - Table from ISO 26262-8:2018

Let’s start with the last column of the IEC 61508 numbers which are for a continuous or high demand safety function with a 95% confidence level. This is very similar to the match from a previous blog on doing reliability predictions at various confidence levels, see here.

For large values of k, the failure rate can be estimated as λ = k/T where k is the number of failures and T is the total operation time. A large value might be 10 or more failures. So, if you operate for 100 million hours (10000 devices running for 1 year) and get 10 failures a good estimate of the failure rate is 10/1e8 = 100e-9 or 100 FIT. But what if you get 0 failures then you would estimate the failure rate as zero which is an unlikely value. There is then a requirement to do a statistical interpretation of the data.

Some people like to work with ϴ=1/λ instead of λ where ϴ=MTTF (mean time to failure). So for λ=100e-9 we have MTTF = 10 million hours. Please remember we don’t expect the item to run for 10 million hours but rather this expresses the expected operation time for a large batch of units until one of them fails provided the stated lifetime of perhaps 20 years is not exceeded for any of the items. If you exceed the rate lifetime of the part, the failure rate will start to increase dramatically.

Now it has been shown, by someone with a lot more time to spend on math than me, that the expression 2Tλ is distributed with the chi-squared distribution represented by χ2.

Our automotive colleagues put it best when they say that

Figure 3 - required service hours formula from ISO 26262-8:2018 clause 14

In this formula

  • f is the number of observed failures which in this case we assume is 0
  • CL is the required confidence level which we assume is 95%
  • tMTTF is the mean time to failure we wish to demonstrate

Then total operation time of tservice is required to demonstrate that MTTF at that confidence.

Note – if we use confidence level of 95% then we are 95% confident that the true value of the MTTF is greater than the calculated MTTF.

So, lets work through the calculation. Once you understand it you can use Excel to do the math for you and forget all about the details.

The table below represents the Chi squared distribution. The first column shows the degrees of freedom (2f+2) and the first row the required confidence level.

So, for f=0 (zero failures) we have DF=2 and for 95% confidence we have p = 0.05 (1-95/100) and we read of 5.991 from the table.

The PFH range for SIL 2 gives an allowed dangerous failure rate per hour of 1e-6 to 1e-7. Putting tMTTF = 1/1e-7 (remember failure rate is 1/tMTTF and using the value of λ from the lower end of the band) we then get the required number of hours as (1/1e-7)*5.991/2 = 30 million hours which agrees with the table in figure 1 above.

The second last column from figure 1 is then easy. Instead of 5.991 we read off the value for p=0.01 (1-99/100) as 9.21 to get a required service interval of 46 million hours.

Figure 4 - table showing the chi-squared distribution

The math works for any number of observed failures and you would require a longer observation period to achieve the same confidence that the failure rate was sufficiently low. However, some people view any systematic failures as unacceptable and say you should fix the cause of the failure and start again. Such people would say the number of failures used should be zero and the df therefore = 2. I think there are many issues with this attitude including the difficult of deciding whether a failure from the field is systematic or not but I won’t get into that debate today as I am only trying to explain where the numbers come from. I do note however that while IEC 61508 uses this math to demonstrate sufficient systematic integrity our automotive colleagues just use it to demonstrate a sufficiently low failure rate including both random and systematic failure modes (see for instance ISO 26262-5 5.8.3 and ISO 26262-8:2018 14.2).

The numbers from ISO 26262 as shown above in figure 2 are different from those in figure for IEC 61508. This because ISO 26262 only requires a 70% confidence level. So once again reading the table for df=2 and p=1-70/100=0.3 we get required service time = 1/1e-7*2.41/2 = 12 million hours.

Automotive only has high or continuous mode operation. IEC 61508 also has low demand defined as a demand rate of < 1 / year. To get the value from column 5 of table 1 we now make the assumption that the demand rate is exactly 1/year (which is the worst case i.e. highest demand rate). On that assumption of 10000 hours per year you simply take the 95% and 99% confidence values for high demand/continuous mode and divide by 10000. So, 30 million hours becomes 30000 demands.

You could argue that using the lower end of the PFH range for a given SIL is conservative e.g. using 1e-7/h for SIL 2 when the range is from 1e-7/h to 1e-6/h. You could argue that if you a done a quantitative SIL determination and determined you need you PFH to be 5.3e-7/h (which is in the SIL 2 range) that you should use that value instead to 1e-7/h. However, machinery safety and ISO 26262 typically use risk graphs and the assumption is that there is just an upper limit for the dangerous failure rate. Conversely you could also argue that the standard writers have used the value from the bottom of the range for a given SIL in that the element or component under evaluation is only part of a safety function and therefore 10% of the budget is allocated for a specific item.  

Similar concepts to proven in use across functional safety standards include

  • Prior use from IEC 61511
  • Field experience from IEC 61508
  • Route 2H from IEC 61508
  • Relevant service experience from DO-254

Some problems with relying on proven in use alone include:

  • Software failures don’t really depend on time but rather things like the number of errors in the code, where in the code the errors exist, the way in which the code is used, the sequence and variability of input parameters
  • Calendar time vs operational time
  • Failures can be hidden by system level redundancy
  • Low consequence failures not reported
  • Are 1000 items operating for 1000 hours really the same as one item for a million hours
  • Not all field failures are reported as unlike permanent hardware failures they can quickly disappear and be hard to reproduce, we are used to tolerating software glitches
  • Shipped items could sit in storage as spares
  • Hard to distinguish random hardware failure from systematic failures
  • Systematic failures might only emerge for a particular set of inputs. If those situations do arise the failure will always occur
  • Is there an acceptable failure rate for systematic failures!
  • In Ireland at least all adverts for investment products state “past returns are not evidence of future returns” or something like that. But that is exactly what you are doing with proven in use using historical data to predict future returns
  • Using Chi-squared to calculate confidence intervals assumes the failure rate is constant and that times to failure are exponentially distributed

IEC 61508-2:2010 7.4.10 gives additional requirements when making a proven in use (route 2S) claim. This includes for instance 7.4.10.3 which requires an impact analysis on any differences between the old and new operating environment. It doesn’t give any clues as to what might make an impact but as a mostly hardware guy I suggest that might include the operating temperature range, a faster clock speed, a different power supply with a faster ramp rate, a different spread of input variables…

But remember the safety standard contains the minimum you need to do to claim compliance with the standard. That sounds a bit negative, but it is not. It is simply a fact. So, a safety case for a proven in use candidate might include additional things like:

  • Details of the development process used to develop the software even if it wasn’t an IEC 61508 compliant development process (if it was an IEC 61508 compliant development process you would claim route 1S rather than route 2S).
  • Information from the field experience argument to show designed into 10 different applications (which means validated by 10 different teams and exposed to a greater combination of inputs).

Remember your goal is to make it easier for your independent assessor to say yes. The more feel-good information you can provide (perhaps additional mitigations sounds more professional) the happier your assessor will feel in what is often an engineering judgement.

Of course, the item for which you claim route 2S is typically being incorporated into a system being developed to IEC 61508 and so will be verified and validated along with the rest of the system. This means at least that it shouldn’t have any obvious errors which should be discovered during integration.

This blog ended up being far more technical than I had hoped. In my head I felt I understood the math better than I obviously do because I struggled to describe it simply. Anyway, I hope you found this exploration of a dark corner of the functional safety standards useful. Perhaps like I had done you will read this, develop a kind of handwavy understanding of the topic and then file it away as something you understand, but never really look at it again.

I started with a quote from Steven Hawking so I will finish with one from Einstein, “The hardest thing in the world to understand is the income tax”. Therefore, if you can do your taxes, then you should be good for the above blog.

To learn more: