A man holding and looking into a glass crystal ball.

Does Reliability Prediction Uncertainty Justify Mandating Two Channel Safety?

I have often blogged before on my dislike for standards that mandate two-channel safety. Thankfully this is getting less common but still, mostly among machinery safety people, there are some who still want mandatory redundancy mostly expressed as a requirement for CAT 3 architectures. In this blog, I will concentrate on one of the most frequent justifications given for such requirements, the uncertainty around reliability numbers. The concerns are mostly around the older mechanical technology with most people accepting that the reliability predictions for newer technology such as semiconductors are better.

Since many of the advocates for two-channel safety come from the machinery sector let's use an argument based on ISO 13849. There is a figure in the standard that shows the combinations of CAT (category – one of 5 standard architectures), DC (diagnostic coverage), and MTTFd (dangerous failure rate) that can be used to achieve various PL (performance level – a measure of the safety required or achieved by a given design).

 Figure 1 - Chart showing how to combine MTTFd, DC, and CAT to achieve the required PL

Figure 1 - Chart showing how to combine MTTFd, DC, and CAT to achieve the required PL

So, if our hazard analysis and risk assessment show we need a PL d safety function the chart shows we could achieve this with a) a CAT 2 architecture, high reliability (MTTFd in range of 30 to 100 years), and DC of low (60%) to medium (90%) or b) a CAT 3 architecture with MTTFd high and DC of low (60%) to medium (90%).

Annex K of ISO 13849-1 gives more granularity on the data than shown in the chart above. Annex K is based on Markov modeling (see links below) of the various architectures/categories. So, let's take a CAT 2 and a CAT 3 architecture which both give PL d.

So, for PL d we need a PFHd in the range of 1e-6/h to 1e-7/h.

Design solution 1 – A CAT 2 (single channel) architecture with a DC of 90% and an MTTFd of 75 years.

Design solution 2 – A CAT 3 (redundant) architecture with a DC of 60% and an MTTFd for each channel of 47 years.

Both solutions give a PFHd of approximately 3.4e-7/h and so in terms of random hardware error safety has equivalent performance. Both are in the PL d range and so in terms of tolerance of random hardware failures meet the PL d criteria.

 Figure 2 - Comparing reliability uncertainties for non-redundant and redundant architectures based on Annex K of ISO 13849-1.

Figure 2 - Comparing reliability uncertainties for non-redundant and redundant architectures based on Annex K of ISO 13849-1.

Now let’s suppose our reliability numbers are bad. Let’s say optimistic by a factor of 2.

So, for design solution 1 using a single channel the MTTFd goes from 75 to 36 years (not exactly halved but as close as the table will give) then the PFHd for design solution 1 is 9.39e-7h.

For design solution 2 using a redundant architecture, the MTTFd goes from 47 to 24 years then the PFHd drops to 9.47e-7/h.

For both the redundant solution and non-redundant solutions we still have equivalent PFHd, and both still give PL d. So, if the reliability estimates by the hardware are out by a factor of 2 it doesn’t matter whether the solution is a single channel or redundant you get a similar degradation in PFHd. So, advocating for a redundant architecture based on reliability uncertainties makes little sense to me. Perhaps those in favor of CAT 3 would claim that it is less likely that the predictions for both channels in a two-channel system would be optimistic but I’m not sure I believe that.

So how do standards protect against uncertainty in the reliability numbers?

  • ISO 13849 protects against an over-reliance on reliability by generally limiting the maximum MTTFd you can claim to 100 years (1141 FIT).
  • IEC 61508 uses confidence levels. So rather than base the reliability on a value that is likely to be the average (50% confidence), IEC 61508 looks for a 70% confidence level as standard and 90% or even 99% in some cases.
  • In addition, standards such as SN29500, which are often used as a source of reliability data, don’t quote a confidence level but I have seen claims that they are at the 99% level. In addition, they mix systematic and random failures which means they could be pessimistic by another factor of 2 if you are only analyzing random hardware failures (systematic failures should be tackled using a rigorous development process).

Therefore, the reliability predictions should already be conservative, and I don’t think it is good practice for people to heap safety margin on top of safety margin on top of safety margin because they feel uncomfortable with what is in a standard.

There is an excellent discussion on “confidence limits on prediction” in the book “Reliability maintainability and risk” where the author compares predictions using site-specific data, predictions using industry-specific data, and predictions using generic data.

Perhaps a better argument in favor of two-channel safety is the protection it can give against systematic failure modes which are not typically modeled at all in reliability calculations. However even then the evidence is mixed, see for instance this blog.

Other relevant blogs in this series related include:

  • A blog on reliability predictions for integrated circuits, see here
  • A blog on how to change the confidence level of reliability predictions, see here
  • A discussion on single fault tolerance requirements in the IEC 61496 series see here
  • A discussion on the maths behind the 1oo2 architecture see here
  • A blog on Markov modeling which is probably the most accurate way to model these architectures, see here and here
  • A blog with a similar title but which doesn’t concentrate on the reliability confidence aspect, see here

For the full set of over 80 blogs in this series on the Analog Devices EngineerZone platform here.