When time permits I like to do these blogs well in advance and in batches of at least 3 or 4. In this batch I plan to discuss CCF and related topics because I recently had to give myself a refresher on the topic as part of discussing IEC 61508-2:2010 Annex E (deals with on chip redundancy). The three topics I have decided to cover in this batch involve CCF, Markov modelling and a discussing on why I think mandating architectures is bad.
If the hardware reliability is not sufficient then using two channels in parallel is a common solution to improve the reliability. Some standards such as ISO 13849 require (not exactly but close enough) two channels. The limiting factor for the reliability for two channel systems are CCF (common cause failures). In this blog I will summarize my thoughts on CCF and how to quantify the contribution of common cause failures to the dangerous failure rate.
Looking across the various standards there is a sort of a general consensus on what CCF means but there are also differences.
Above is the definition from IEC 61508-4:2010 and below is the definition from ISO 13849-1:2015
The ISO 13849 definition states it must be from a single event but the IEC 61508 definition allows multiple events. The IEC 61508 also says they must be concurrent but what does that mean? The ISO 13849 definition appears to exclude cascading failures. Neither definition says anything about common cause failures being limited to just dangerous failures. In general some of the definitions could do with some tightening up. While the dictionary definition of concurrent might say “at the same time” in practice this may mean within the fault tolerant time or at some small fraction of the demand rate. In regards to a single event I think the quantification must include cascading failures or else the arithmetic won’t add up. Below is a Markov model to explain why I say it won’t add up without including cascading failures. The path from channel A failed to channels A&B, failed is modeled with a failure rate of λ but if the failure rate of B changes depending on whether A has already failed this is not valid. Therefore perhaps a dependent failure analysis is better than a common cause analysis to capture the intent.
Figure 1 - showing CCF on a Markov model for a system with identical redundancy
The important thing to bear in mind is that we are trying to describe a situation where you have decided to implement a two channel system to improve the reliability and the limiting factor is common cause failures. Based on safety common sense (SCS) I would be inclined to take a conservative approach and include any failures which would cause both channels to fail within the process safety time regardless of whether they are cascading failures or not.
Note – I just invented the SCS acronym.
Note that there is no reference in either definition as to whether the failures are due to systematic or random failure modes. It is implied that as part of following a good design process you will have minimized the systematic failure modes and that the common cause failure rate including any remaining systematic issues is somehow correlated to the random failure rate of the components. There is a lot of engineering judgement built into the various quantification schemes for the common cause failure rate and in ISO 13849, IEC 62061, IEC 61508-2:2010 Annex E (semiconductors) and IEC 61508-6 Annex D tables are used. In most cases it is based on a “beauty contest” whereby you get points for features you have implemented either during the design or in operation and depending on the total number of points you are allocated a β-factor (I will come back to this later) to allow quantification of the PFH (probability of failure per hour dangerous). ISO 13849 is more a pass/fail test and if you get above 65 points you can assume a β-factor of 2%. IEC 62061 gives four levels of β-factor from 1% to 10% depending on your score and IEC 61508 also gives 4 values ranging from 0.5% to 10%. In IEC 61508 and IEC 62061 you get a β-factor of 10% even if you do nothing. In all these systems diversity gets you lots of bonus points but is not essential (diversity can sometimes add complexity and so can be negative). The IEC 61508-2:2010 Annex E methodology for semiconductors is unusual in that it mandates a long set of minimum features and then has a points system which has both positive and negative points. You start assuming a β-factor of 33% and if you get below 25% you are allowed to say you have a HFT (hardware fault tolerance) of 1. This can be important as some standards require a HFT of 1 (see an upcoming blog for my thoughts on such requirements). I actually used the table in IEC 61508-6:2010 to calculate a β-factor for a CAT 3 architecture according to ISO 13849 and reassuringly with my set of assumptions I got a β-factor of 2% as claimed by ISO 13849. It all helps to give confidence in a system which is largely based on engineering judgement.
So far I have just used the term β-factor but even IEC 61508 uses β, βD, βIC and βint. In the course of discussing the quantification of the β-factor below I will attempt to explain what each of these means. An important point to bear in mind is that even though the definition of CCF includes both safe and dangerous failures functional safety is more concerned with dangerous failures. The use of the SFF (safe failure fraction) metric is one of the few times us safety guys concern ourselves with safe failures.
For the purposes of the math many of the standards assume identical channels despite there being a lot of points for diversity.
Figure 2- illustration of Beta factor
The above graphic attempts to show the β-factor for two diverse channels. On the left you see the two channels with no overlap so the β-factor is 0. On the extreme right you see that all failures of the blue channel are also failures of the yellow channel so the β-factor is one and in the middle you see a more typical case where the β-factor is between 0 and 1. The above also shows that the actual common cause failure rate contribution to the PFH is given by βmin(λDU1,λDU2) as the CCF rate can be no more than the λDU of the channel with the lower failure rate.
Modeled as a reliability block diagram (RBD) the failure rate will look like the below.
Figure 3 - A reliability block diagram and PFH for a 1oo2 architecture from IEC 61508-6:2010
If the β-factor was not modeled then putting two systems in parallel each with a failure rate of once per 1000 years would give a system with a failure rate of once per million years. If the β-factor is modeled however then the failure rate will improve to somewhere between once per 10 thousand and once per 100 thousand years a much more realistic improvement.
Common cause failures can also be modeled using fault tree analysis. In fact modeling it using an FTA is a great way to show all the systematic contributions to the β-factor. Having all the modeling methods available in your tool box and picking the right one for the job is a skill that comes with experience.
Figure 4 - Various ways to model CCF – Markov , FTA , RBD
I’m going to cover Markov models in a future blog and Markov models also have a place to play in quantifying CCF. In fact from the reliability block diagram above it is hard to see how the PFH figure given for a 1oo2 architecture in IEC 61508 is so complicated but it is a lot easier to see in the Markov model (at least at an intuitive level).
Figure 5 - Markov model from IEC 61800-5-2:2007 Annex B
In this Markov model you can see that while the CCF shown in the middle of the model takes you directly from the safe to the dangerous state you can also get there from states S2 and S3.
Its hard to give general guidance on when to use reliability block diagrams vs FTA or Markov modeling but you will feel better if you have all three available in your arsenal.
I promised to explain some of the variants of the β-factor used in the standard.
IEC 61508-6 separates out the β-factor for detected and undetected failures. βD is the β-factor for the dangerous detected failures and the simple β stands for the much more important dangerous undetected failure rate. Many of the calculations in IEC 61508-6 assume that βD=0.5β but there is no indication where this comes from. If it said βD=β it would align with the 50% safe 50% dangerous approximation which is generally viewed as conservative and for a long time this is what I actually thought it meant.
While in theory in a system with more than two channels a common cause failure would impact on all channels IEC 61508 modifies the assumed β-factor to be less conservative. IEC 61508-6 uses βint to mean the β-factor for a basic 1oo2 system. You can then use βint to calculate a β-factor for an arbitrary MooN system.
βIC is the β-factor for an IC according to IEC 61508-2:2010 Annex E. There is no mention of whether βIC is meant to stand for β or βD as given above and so conservatively I assume that it means β (fraction of dangerous undetected failures). Being conservative here is generally not so bad as ICs are generally very reliable and so β*λDU will still not be so high.
Note – soft errors are not considered when calculating a β factor.
Figure 6 - a dependent failure analysis view of common cause failures
Another way to look at common cause failures is shown above. In this view you have common cause failure initiators but a common cause failure then only occurs if there is some coupling mechanism between the channels.
From the above means to reduce the dangerous common cause failure rate include
- Reduce the common cause initiators
- Reduce the common cause coupling
- Reduce the failure rate of at least one of the redundant components
You could literally write a book on common cause failures and possibly someone has. Hopefully you have found the above summary useful even if I have used a few simplifications above and ignored a few nuances.
The next blog will concentrate on Markov modelling which is a very useful technique for modelling common cause failures. It was tempted to do the Markov modelling blog first but it is somewhat of a chicken and egg situation.