Fresh natural red cherry vs 3 sugar cubes on a balancing scale

When Two Isn't Better Than One: Understanding Redundant Safety Systems

The PFH for a 1oo2 system – A Deep Dive 

Compliance with the functional safety standard IEC 61508 has three key requirements:

  • Have a sufficiently low dangerous failure rate. Expressed as a Probability of Dangerous Failure per Hour (PFH) or a Probability of Dangerous Failure on Demand (PFD)
  • Be free of design errors (systematic capability)
  • Meet minimum Safe Failure Fraction (SFF) and Hardware Fault Tolerance (HFT) requirements for the required Safety Integrity Level (SIL) (prevents over reliance on reliability claims)

When calculating the PFH for a redundant 1oo2 (1 out of 2) system, the equation from IEC 61508-6 is shown below. In this blog, I will analyze this equation and try to extract some insights.

 PFH for 1oo2 architecture

Figure 1: PFH for 1oo2 architecture from IEC 61508-6:2010 B.3.3.2.2

In the above

              β is the common cause failure rate of dangerous undetected failures

βD is the common cause failure rate for dangerous detected failures. I normally put this equal to β for ease of calculation and equation manipulation

              T1 is the mission time, e.g., 20 years, and is the time an undetected failure has to occur so that if a demand occurs, bad things can happen. On average, a failure, if it does occur, will occur after T1/2. I often set a year to 10000 hours instead of 8760 for simplicity. T1 can also be the proof test interval, but I will assume we don’t want to proof test.

              MRT is the mean restoration time (how long it takes to get the channel repaired and back up and running) and might be 1 day, but I generally set it to zero for calculation ease since it is so much less than T1 (20*8760/2+24 is approximately 8760)

              MTTR is the mean time to actually implement the repair, e.g., 1 hour, but I normally set it to zero for calculation ease since it is so much less than T1

              λDU = dangerous undetected failure rate – these are the bad failures, as the channel is down and you don’t know it until you do a proof test. An actual demand might expose it but in two channel systems these failures are generally hidden as the other channel is still functional.

              λDD = dangerous detected failure rate – these are almost like safe failures, as the system will take you to a safe state. However, they will hit your availability.

 PFH equation

Figure 2: PFH equation repeated for convenience

Looking at this equation, the last term βλDU represents the minimum failure rate we can achieve. The rest of the equation represents the contribution of an accumulation of failures. By accumulation of failures, I mean undetected failures that appear in one channel and have no impact until the other channel also fails, not through a common cause failure, but rather randomly with either the same or a different failure mode.

It is worth noting that there is nothing in the equation for the

  • Diagnostic contribution – but do need to be considered, see part 2 annex d, and will be more clearly specified in the new IEC 61508 version 3
  • Systematic contribution – generally considered unquantifiable
  • Demand contribution

I hope to cover these in a future blog.

To extract further insights from the equation, it is necessary to make some assumptions and approximations:

  • A year is 10000 hours
  • Proof test internal T1 = Mission time = 20 years
  • β=βD
  • MRT=MTTR=0

And then we can reorder the equation to give a simplified equation:

An Equation simplified

Figure 3: Equation simplified as above

Some obvious questions then are:

  • Does redundancy add much value once the accumulation of undetected failures is added?
  • Would proof testing help?
  • Does it make more sense to concentrate on adding diagnostics or redundancy?

First, let’s rearrange the equation:

Rearranged equation to highlight the common cause and accumulation of failures section

Figure 4: Rearranged equation to highlight the common cause and accumulation of failures section

Noting that (1-DC)λD = λDU, let’s look at our 3 questions above:

Does Redundancy Provide Significant Value When Considering an Accumulation of Failures?

Logically, at time 0, failure is less likely except for common causes, which don’t care how many channels you have. Over time, however, one or the other channel will fail, and the failure rate will approach the rate of the failed channel. Since you have two channels to fail, the time to reach this state is higher than the failure rate of a single channel. But let’s concentrate on what the equation states.

With β=0.02 (the assumed value from ISO 13849-1), the accumulation of independent failures dominates if λDU>1e-7/h.  What does this mean in practice? I created the table below, with very high and very low failure rates, to illustrate the issue.

 Table of example calculations

Figure 5: Table of example calculations

With very high, dangerous, and undetected failure rates, this architecture gives you a 100x improvement for a 10X increase in reliability. At very low failure rates, the improvement tails off to eventually match the improvement in reliability, i.e., behaving as if you only had one channel. The point at which the CCF contribution begins to dominate is at λDU=1e-7/h and below.

If β increases to 10%, then the CCF dominates when λDU = 6.2e-7/h and below.

To answer the question for Beta = 2%, it’s worth adding a second channel when the dangerous undetected failure rate is above 1e-7/h. Worse than that and the accumulation of independent failures has sufficient time (20 years for this example) to become an issue.

Would Proof Testing Help?

The first term is independent of the proof test. So, no matter how much you proof test, you are left with λDUβ. That’s the best you can achieve. Proof testing is of value when the accumulation of failures is dominating, which is when the dangerous undetected failure rate is high.

Does it Make More Sense to increase DC instead of HFT?

The crossover point is when λDU=100 FIT. Increasing SFF will reduce that to 50 FIT, 10 FIT……

Warning 1 – In this blog, I took the equations at face value. For a discussion on the accuracy of these equations, see the new ISO 13849-3 draft and ISO TR 12489.

Warning 2 – the conclusions made here are good for the assumptions made.

Assuming you got to the end of this blog, I invite you to check back on the second Tuesday of next month for the next blog in this series. Until then, I hope to post “mini blogs” on the other Tuesdays in the month directly from my LinkedIn account. Please follow me on LinkedIn if interested.

Related Blogs

1.  The 5 Advantages of Hardware Fault Tolerance

2. Is Mandating a Category or HFT Best Practice

For previous blogs in this series, see here.

For the full suite of ADI blogs on the EngineerZone platform, see here.

For the full range of ADI products, see here.

  • Some thoughts on your post. 

    • Diagnostic contribution – but do need to be considered, see part 2 annex d, and will be more clearly specified in the new IEC 61508 version 3
    • Systematic contribution – generally considered unquantifiable

    Without the (self) diagnostic in the system, the DC will be set to 0 and the beta-factor is the maximum e.g. 10% so from my opinion this leads to a diagnostic contribution. The beta factor term is the one, which gets two independent channels more together not to calculate a system better than it is.

    As far as I know, the beta factor shall, up to a certain amount, also cover the systematic errors brought in although certain process measures have been taken (Table D.1, Part 6 IEC61508: "Were common cause failures considered in the design review with the results fed back into the design?").