A woman standing in front of a wall covered in complex math formulas.

I Want It All - High Safety and Availability – Part 2

This is part 2 of a blog covering the 1oo2D architecture. The first part of the blog is available here. The first part of the blog dealt with a description of 1oo2D and the implications for things like the diagnostic test internal and its PFH/PFD comparison to 1oo2.

This part 2 blog will cover the equations behind 1oo2D as found in IEC 61508-6:2010. I have tried to explain them in a readable manner so that you have a good feeling about the architecture.

For a quick refresh, 1oo2D is often drawn as below. It’s an architecture that gives high safety and high availability without the expense of 2oo3 by starting as 2oo2 and then when an identifiable fault occurs in either channel shutting down just that channel to degrade to a 1oo1 architecture rather than shutting everything down. This is especially valuable for the process industries where shutdowns cost a lot of money and the cost of the shutdown far exceeds the cost of the redundancy.

Note – to avoid the questions I got after posting the last blog, by high safety I mean higher safety than a 1oo1 architecture and by high availability, I mean higher availability than a 1oo1 architecture.

 Figure 1 Tom's interpretation of 1oo2D

Figure 1 Tom's interpretation of 1oo2D

IEC 61508-6:2010 gives the equations for 1oo2D in high demand mode and they are shown below.

 Figure 2 - Equation from IEC part 6 for a 1oo2D architecture

Figure 2 - Equation from IEC part 6 for a 1oo2D architecture

The first of the 3 equations looks the most harmless but contains some important assumptions that may or may not be valid for your specific use case. You can’t use the equations unless the assumptions made match your use case.

In calculating λSD (safe detected failure rate) it assumes that 50% of the failures are safe and 50% are dangerous when it uses λ/2 in the formula.  I have blogged previously on this assumption, see here. It also assumes that the same DC (diagnostic coverage) applies to safe as well as dangerous failures which is probably true in general but that’s just a feeling on my part.

The middle of the 3 equations calculates the “channel equivalent mean downtime”. To get a better idea of what this means,  let’s assume the mission time/lifetime of the safety system T1=20 years, MRT(mean repair time)=MTTR(mean time to restoration)=8 hours, and the SFF = 90% (so that λDU=10%) giving tCE’=1 year.

Put another way if a failure occurs during the 20-year lifetime, on average it will occur 10 years into the lifetime, and since 90% of the time it will be detected then the average undetected failed time is 10% of 10 years = 1 year. So, on average/to be expected it will be down for 1 year and nobody will know, everybody will think they are safe, but they may not be. One year out of twenty being unsafe would be bad but tCE’ is then only part of the third equation where the redundancy inherent in 1oo2D is used to improve the outcome.

The most important equation and most intimidating equation is the third one which gives the PFH (average probability of dangerous failure per hour). Luckily the equation can be broken into three parts and we can look at each part in turn.

 Figure 3 - Looking at the 1oo2D equation in more detail

Figure 3 - Looking at the 1oo2D equation in more detail

Starting at the end the third part represents common cause failures. If β=0 the third part goes to zero. Typically, however, you will have a β(fraction/percentage of failures that cause both redundant parts to fail at the same time) in the range of 1% to 10%.

The second part of the equation represents voting failures. If K = 1 represents a perfect voting circuit the second part of the equation goes to zero. Voting for 1oo2D means the switchover from a 2oo2 architecture to a 1oo1D architecture when failures are discovered in one of the redundant portions. With K =0.98 then 4% of the dangerous detected failures of either circuit is added to the PFH value. This means that if everything stays the same but you improve the diagnostics, which increases λDD (dangerous detected failure rate),  this second term increases making the PFH worse which is counterintuitive. It’s a downside of the voting circuit. As explained in part 1 you might however be able to implement your voting in software in a PLC.

The first part of the equation is a lot more difficult or is it? It represents the probability of an undetected dangerous failure in one channel with an independent failure of the second channel.

My effort to show this is as follows.  The (1-β) is to quantity the fraction of failures which are not common cause. To simplify let’s put β=0 and the equation becomes 2λDUλtCE’. Then this shows you that for bad things to happen you need a dangerous undetected failure of one channel (which on average exists for tCE’) and any failure of the second channel. This situation is dangerous because while you know one channel has failed you believe you can still rely on the other channel which has failed dangerously undetected. The factor of 2 is because there are two ways this can happen. It could be channel 1 which fails first and then channel 2 or channel 2 which fails first and then channel 1.

Equations aren’t the most interesting part of functional safety but it's good to understand the quantifiable aspects of safety. You need to understand the equations before you blindly use them. Hopefully, this blog will help get you started on your journey.

There will hopefully be a part 3 blog covering the availability calculations for 1oo2D but first I think I need a blog on what is meant by availability. See you next month for more.

To learn more see

For the full set of blogs in this series see here