Did you know that even the definition of safe and dangerous failures varies from standard to standard? Did you know that a safe failure according to ISO 26262 may not be safe according to IEC 61508?

In this blog, I will discuss the failure definitions and give examples of how to apply them and the problems applying them.

Firstly, some history. The first version of IEC 61508 was released in 1998, it was revised in 2010 and hopefully revision 3 will come out around 2024. In going from IEC 61508:1998 to IEC 61508:2010 the definitions for safe and dangerous failures were changed and no effect and no part failures were added. In this blog, I will discuss those definitions and give some examples of how applying them can be difficult, especially at the component and IC level.

Figure 1 - Change in safe and dangerous failure definitions from IEC 61508:1998 to IEC 61508:2010

With the new definitions, a failure is only safe if it causes the system to trip or makes it more likely to trip. Tripping means that the system goes to its safe state or more precisely “a” safe state. The automotive ISO 26262 standard is more closely aligned to the definitions from IEC 61508:1998 than to the 2010 versions.  I understand the parts b) of the definitions were added to allow for failures in a redundant portion (HFT>0) of a multi-channel safety system. I will come back later to show how those parts b) cause problems.

In addition to the above, there are since 2010 other failure types:

  • No effect failures – failure of an element that plays a part in implementing the safety function but has no direct effect on the safety function
  • No part failures – failure of a component that plays no part in implementing the safety function

For continuous mode safety functions something which maintains the safe state is also considered a safe failure. However, it is not clear to me why such a failure doesn’t meet the requirement to be considered as a no effect failure. There is probably some subtlety that I haven’t thought of!

Figure 2 Definition of safe state from IEC 61508-2:2010

No effect failures were added in the 2010 version to prevent being able to influence the SFF calculation by considering circuitry not relevant to the safety function. The most interesting and debatable phrase in the definition of no effect failures is “no direct effect” but yet it “plays a part in implementing the safety function”. For instance, does diagnostic circuitry play a part in implementing a safety function, but since the safety function would operate fine without the diagnostics it has no direct effect on the same function. The standard doesn’t say so you must rely on experience to make the determination.

The no part failure definition is also interesting in that it is the only one of the four failures which refers to component as opposed to element. Is this difference significant or just an error?

Below I display the failures as a Venn diagram, and you could say a no part failure is simply any failure which doesn’t fall into any one of the other three categories. It is also obvious that a failure cannot be in more than one set. To decide if a failure of a component is safe, dangerous, no effect or no part, you need to know the details of the safety function and especially what it is trying to achieve. Therefore, at for instance the IC (integrated circuit) level you must have an assumed use case to do any FMEDA.

Figure 3 - A Venn diagram of failure types

To put the issues with the failure definitions in perspective we should first consider why the definitions exist and how they are used. The definitions are important to:

  • Allow calculation of the SFF per IEC 61508-2:2010 7.4.4.2 which is a necessary requirement for route 1H
  • Allow calculation of PFH (average frequency of a dangerous failure per hour) or PFD (probability of dangerous failure on demand)
  • To decide if you need a design charge or to issue a new errata because you have had a dangerous systematic failure

SFF (safe failure fraction) and HFT (hardware fault tolerance) represent additional hardware constraints placed on an element even if you already meet the PFH/PFD requirements. In effect it means you cannot rely too much on hardware reliability alone. The requirements for SFF and HFT are given in the table below.

Figure 4 Table 3 of IEC 61508-2:2010

Note table 3 applies to elements and sub-systems only.

For those not aware of what SFF means it is similar to the single point fault metric from ISO 26262 and more generous than DC (diagnostic coverage) from ISO 13489. It is the ratio of safe + dangerous detected failures to the total of the safe and dangerous failures.

HFT of X means that X+1 is the number of faults which can cause the safety function to fail. So HFT = 0 means a single fault can cause a system to fail dangerously. (Note in contrast to ISO 13849 you don’t consider diagnostics when calculating the HFT – see IEC 61508-2:2010 7.4.4.1.1 a).

Given that one of the reasons for the safe, dangerous and no effect definitions is so that SFF can be traded against HFT it therefore seems logical that only single point failures should be considered and that diagnostics are not considered due to their exclusion from consideration when determining HFT. Therefore, my first assertion is that all failures of a diagnostic are no effect failures of the safety function and don’t contribute to the SFF calculation.

Let’s see some issues with trying to apply them at various levels.

Example 1 of problems – let’s take a toxic gas sensor and let’s say if the gas concentration is > 10% someone can die. Let’s also suppose the toxic gas sensor has a stated safety accuracy of +/-2%.  Let’s also suppose the gas is measured with an ADC(Analog to Digital converter) and one of the least significant bits gets stuck at 0 so that the meter underestimates the gas concentration by 1%. I think it is fair to say that underestimating the toxic gas concentration “decreases the probability that the safety function operates correctly when required”. This makes it a dangerous failure. You could counter argue that since it is within the +/-2% safety accuracy it is a safe failure, but if you did that you are forgetting it must make the system trip or be more likely to trip to be considered as safe. Therefore, perhaps only a +1% deviation is safe but a -1% is something else. You could also argue that +/-1% deviations are both no effect because you are still within the rated safety accuracy.

Example 2 – let’s take an item and its diagnostics. Something like a CAT 2 architecture from ISO 13849 (single functional channel with external monitoring). If the diagnostics fail, then future failures of the main channel will be undetected which “decreases the probability that the safety function operates correctly when required”. This reasoning makes it a dangerous failure. However, you could also argue that the failure of a diagnostic has “no direct effect” on the safety function and such failures of the diagnostics are therefore no effect. This example also highlights an anomaly whereby for the purposes of calculating SFF a failure could be no effect but still lead to an increase in the PFH/PFD using Markov model to model the failure of the diagnostics and the failure of the channel.

Example 3 – let’s take an integrated circuit implementing part of a safety function. Let’s say that the safety function will fail dangerously if the ground pin fails open. A reasonable response might be to add a second ground pin so that both would have to fail open for the safety function to fail dangerously. But now you could argue that I have doubled the dangerous failure rate since either one failing open “decreases the probability that the safety function operates correctly when required”. Opening of a redundant ground pin is very hard to detect so such failures would then be dangerous undetected. This means your SFF is worse despite taking steps to make the application safer. You could claim the ground failing open is a no effect failure since the safety function continues to operate correctly and therefore has “no direct effect”. We face similar issues with redundant decoupling capacitors. If you are happy to calculate your PFH/PFD down to the level of ground pins you could model the two ground pins as being redundant and the failure rate due to an open ground pin would go from λground open to β λground open where β represents common cause failures and a typical value might be 10%.

Example 4 – what about a TVS diode added to increase EMI (electromagnetic immunity) by clamping the maximum voltage on an IC pin. You could argue such components have “no direct effect” on the safety function because the safety function would operate correctly in most cases with the component removed. Failing open would just remove the component from the circuit leaving it now more susceptible to EMI but failing shorted might reduce your power supply voltage to zero and cause the system to trip. Therefore, it might be reasonable to state both failure modes of the component are dangerous. I think the “no direct effect” bit is over-ridden by the fact that once the components are in the circuit, they do have a direct effect. However, I note the following advice in the latest 2021 draft of ISO 13849-1.

Figure 5 - Extract from clause 6.1.10.1 of ISO 13849-1:2021 draft

I’m fairly certain I submitted a comment on this for a previous draft, but I see it is still there and it suggests that the failures of such components must be no effect.

I like the advice below from our automotive colleagues bearing in mind that automotive has no concept of “no effect” failures and therefore typically classify “no effect failures” according to IEC 61508 as safe failures.

Figure 6 - an extract from ISO 26262-10:2018

What I think they are trying to say is illustrated in the figure below. Some deviations are errors as opposed to failures.

Figure 7 - difference between an error and a failure (apologies to the creator of this but I can’t remember its source)

Another problem with the IEC 61508 definition is that the definition of failure is given as below. However safe and dangerous failures include failures which merely increase the probability of failing rather than “termination of the ability”. Does this mean that it can be dangerous failure but not a failure?

Figure 8 - definition of failure from IEC 61508-4:2010

In the book, Reliability Assessment of Safety and Production Systems we find the idea of “non-critical unsafe failures: unsafe failures which just decrease the probability of success of a safety action” and “critical unsafe failures: failures which completed inhibit this safety action”. This seems to be an effort to rationalize some of the issues raised above.

Figure 9 - Figure 36.10 from the book Reliability Assessment of Safety and Production Systems

My thoughts on the matter

  • Only single point failures are relevant when calculating SFF, but multipoint failures contribute to PFH/PFD
  • I don’t think part b) of the definitions is actually required to allow for MooN (redundant) architectures as HFT is a concept applied at the element level to trade off against SFF.
  • Failures of diagnostic blocks are no effect failures since while they “play in part in implementing the safety function” then have “no direct effect” on the safety function
  • Part b) of the safe and dangerous definitions should be removed or their intent made clearer

Some of these issues are related to the philosophical question as to whether you should implement the intent of those who wrote the standard or what is written in the standard. In an ideal world we would follow the intent but there are lots of problems with this including who decides on what the authors intended, perhaps the exact wording is a careful compromise, or it could just not have been thought out thoroughly. Another problem is that compliance with a standard means compliance with all applicable requirements of a standard, but if there are requirements that you in your wisdom consider as stupid and wrong where do you stop. Safety is meant to be done conservatively which could mean you should also take the least favourable decision in every case. But that might make safety unaffordable and less likely to be used resulting in a less safe world.

Anonymous