I recently read an article posing the question “Is empathy amoral” (a link at the bottom of the blog so you won’t get distracted now). I found this hard to accept until I read the article. Saying SFF might be bad is a bit like saying that empathy might be bad. How can this be? It started gradually but lately it has accelerated, I think I might be losing my faith in SFF. For those that don’t know, SFF stands for safe failure fraction and is a measure of diagnostic coverage used in the functional safety standard IEC 61508. In this blog I will explore some inconsistencies surrounding SFF.
Inconsistency 1 – Suppose you have an item which is well designed, with expensive highly reliable components and it has a dangerous failure rate of 50 FIT. In terms of PFH that makes it good enough for SIL 3. But for SIL 3 you also need to satisfy the SFF metric and so you will need to add diagnostics to get the SFF up to 99% or add redundancy with each of the redundant items having an SFF of 90%. Alternatively, you could buy cheap unreliable components with a FIT of 5000 and add sufficient diagnostics (DC = 99%) to get the dangerous failure rate down to the same 50 FIT and you are good for SIL 3. Adding diagnostics or redundancy adds complexity to a system which is already good enough in terms of reliability. Is there any real evidence that the benefit of the diagnostics out weights the risks added by the increased complexity? Is it really good to rely on diagnostics to make an inherently less reliable system into a more reliable one?
I recently ran a poll on LinkedIn asking if people would prefer an inherently reliable system or one which required diagnostics to get the same reliability. The vote split evenly 50/50 on the topic after 58 votes. As the organizer of the poll, I could see who voted and it included 10 members of the responsible IEC working group and of them 7 voted for the more inherently reliable system with an SFF=0.
Figure 1 - results of a recent LinkedIn poll related to this topic
Some of those who chose the unreliable system with diagnostics seemed to assume that the real FIT, including systematic(design) failures for the system with diagnostics would be better even if the reliability predictions say they are the same. Perhaps that that is an arguable case. One conclusion from the vote is that I am not alone in my doubts about the value of SFF or DC as metrics. I am told that back around 2008 it was a big debating point within the IEC 61508 committee. One of the outcomes of that debate was route 2H which relies on reliability numbers from the field and redundancy and has only minimal diagnostic requirements. Unfortunately route 2H is not suitable for use by new electronic modules and the redundancy requirements are onerous and unwieldy for things like robot safety systems where space, weight and cost are important.
Inconsistency 2 – Suppose you have an element containing a uC and nothing else. Suppose that uC has a 1 Mega bit RAM and we use a number of 1000 FIT / megabit to calculate the soft error rate of that RAM, then the element will have a failure rate of at least 1000 FIT. Suppose we add ECC as a diagnostic on the RAM and use spatial separation of logically contiguous bits to prevent multi-bit errors in a single word, we can easily claim a DC (diagnostic coverage – i.e., fraction of dangerous failures detected) of 99% and an SFF for that block of closer to 99.9%. That might mean for that element you could ignore all other failures and the element would meet its SFF metric. It would be bad practice, but you could still claim compliance to the standard.
Inconsistency 3 – items can fail due to systematic or random failures. It is intended that if you take all the required measures for a given SIL that the systematic errors will be eliminated. Nevertheless tables A.15 to A.17 of IEC 61508-2 recommends measures to control systematic failures but there is no metric similar to SFF. Looking back at inconsistency 1 you could view both soft errors and EMI as being due to systematic failure modes with a target of elimination and therefore not subject to the hardware fault tolerance requirements. Perhaps you don’t need a separate metric for systematic failure modes on the assumption that the SFF for the hardware metrics means that a lot of the failures due to systematic failures will be detected by the diagnostics added for the hardware. However, this approach has weaknesses where you use identical redundancy to achieve SIL 3. Each of the redundant items then only has a requirement for an SFF of 90% but the systematic capability is still SIL 3, and redundant items could fail at the same time due to systematic failure modes defeating the redundancy.
I am not alone with my concerns. A paper written while IEC 61508 revision 2 was in progress is shown below along with one of its conclusions.
Figure 2 - an article extract
In the book “Reliability assessment of safety and production systems” in section 36.2.6 entitled “Safety failure fraction: The false good idea” the beneficial effects of SFF are described as irrational and “a practical criterion avoiding to think”. No mincing of words there.
To the best of my knowledge DO-254 the avionics standard has no requirement for diagnostics and only has the avionics equivalent of PFH/PFD. However, I’m open to correction on this.
However, the counter argument is that most other domains have some sort of diagnostic metric. For instance, ISO 13849 has the DC metric. I would argue that DC (diagnostic coverage) from ISO 13849 is an even worse metric than SFF because it gives you no credit at all for safe failures and concentrates only on dangerous failures, but they use the same 60%, 90% and 99% values as IEC 61508. The rationale for the DC and category (similar to HFT) given in ISO 13849 is that they don’t want to put too much weight on raw reliability. I’m not convinced that this argument holds up because whether you use an MTTFd of 50 FIT with DC = 0 or an MTTFd of 5000 FIT with a DC of 99% you are still subject to inaccuracies on the reliability prediction.
Our friends in automotive have a single point fault and latent fault metrics with similar values to those from ISO 13849 and IEC 61508. I have never heard any rationale for the automotive requirements other than ISO 26262 is the automotive interpretation of IEC 61508, so it copied the IEC 61508 requirement.
The ultimate in diagnostic coverage comes from the machinery guys. The IEC 61496 series covering laser scanners, 3D TOF, light curtains etc requires no dangerous failure modes at all which for a single channel safety system is a DC of 100%. For a redundant system the DC can be less.
Figure 3 - Single fault tolerance requirement from IEC 61496 series
In fact, the IEC 61496 guys effectively double down on their love of diagnostics with a 100% latent fault metric. I interpret this as covering situations such as first failure – a failure in a diagnostic/monitoring block which is not detected. Such a failure does not cause a failure to danger until a failure occurs in the functional channel and the diagnostic isn’t there to detect it and trip the system.
Figure 4 - latent fault metric from IEC 61496 series
The IEC 61800-5-3 (functional safety of encoders) guys define it more explicitly with ideal fault detection.
Figure 5 - ideal fault detection from IEC 61800-5-3
Its all the more strange because a lot of machine safety functions are “only” SIL 2/PL d where IEC 61508 would allow an SFF of just 90%. For SIL 3/PL e you generally must use a redundant system anyway because getting to a diagnostic metric of 99% is difficult.
To me requiring a DC of 100% might be suitable when considering unreliable mechanical components as the only means to get the required reliability. But is unsuited to the use of highly reliable electronic solutions. However, most machine safety guys are aware of EN 954 or control reliable which relied heavily on architecture as the first line of defence and used fascinating things called safety relays which could achieve 100% DC.
While I do agree that reliability predictions can be variable depending on the source, on average, I believe the ones based on sources such as IEC 62380 and SN29500 are conservative. Even where they are wrong for some components I think so long as you don’t engage in reliability shopping (choosing a reliability source to get the number you need) it will all work out in the end. I had a good source which analysed various safety components and compared their reliability predictions to the actual reliability from the field and found they were indeed conservative, but I couldn’t remember the source in time for the blog. Either way the use of conservative data is better approached by having confidence value requirements which typically run from 70% to 99%.
I did a related blog back in April 2021 entitled Diagnostics are they worth the effort. This present blog is a refinement of my ideas since April. As I said in the first line, I think I might be losing my faith in diagnostics. However, having worked on standards I do appreciate that it is hard to write simple rules than cover every eventuality. In practice the standards need to be interpreted in light of state of the art and what the standard authors intended with a given requirement. So perhaps even if it is sometimes flawed perhaps SFF is a good compromise.
I think my next blog will be on a related topic of the definitions behind SFF for safe, dangerous, no effect and no part failures. However, a month is a long time in blogging and who knows what topic will catch my fancy before then.
To learn more:
I promised you a link to an article on the morality of empathy and here it is. I plan to respond only to comments on SFF and not those on empathy. The link was only made to show that even things with an excellent motive may not always be unequivocally good.