Standards such as the robot safety standard ISO 10218-1:2011 mandate hardware fault tolerance (HFT). While the title of the blog contains a question mark I for one have strong opinions as to whether mandating such hardware fault tolerance is good practice. I will present the evidence below, which I admit may be contaminated with confirmation bias, but judge for yourself. Don’t get me wrong, if using COTS (commercial off the shelf) products or similar then HFT is probably required but if designing a safety system based on semiconductors with a rigorous development process, high diagnostic coverage and high reliability then I think the arguments insisting on HFT are not strong. While safety is the top priority making safety unnecessarily expensive does not make the world a safer place. If safety is too expensive or physically too large it will not get used as much as a more reasonably priced safety system which meets the requirements (see my previous blog on "State of the Art").
As an example of the type of requirements I am talking about please see below:
Figure 1 - requirement for HFT in ISO 12018-1:2011
The January 2019 draft of ISO 13849-1 also contained minimum HFT requirements if using products developed to IEC 61508. Following this draft to get PL e you must have a HFT of 1. In practice I imagine not many systems meet SIL 3 with HFT=0 so perhaps this is not a big issue but that is not the point.
Most of the standards requiring HFT are in the machinery sector and I believe this dates back to the old EN 954 standard while relied heavily on architecture to give protection against failures. Newer non-sector specific standards such as IEC 61508:2010 allow an explicit trade-off between HFT and diagnostic coverage (actually SFF).
Figure 2 - trade-off between HFT and SFF in IEC 61508-2:2010
For the safety integrity of hardware IEC 61508-2:2010 offers two routes and route 1H uses the table above. Route 2H is not based on the above table but rather reliability data based on field feedback.
SIL |
Minimum HFT requirement from IEC 61508-2:2010 sub-clause 7.4.4.3.1 |
SIL 4 |
HFT 2 |
SIL 3 |
HFT 1 |
SIL 2 |
HFT 1 (high demand) HFT 0 (low demand) |
SIL 1 |
HFT 0 |
Perhaps the fact that using route 2H requires a minimum HFT but route 1H doesn’t is another clue as to why machinery standards tend to rely on redundancy.
Of course, hardware fault tolerance does nothing for systematic failure modes which many people believe dominate especially when reliability is high.
Let’s go back to basics. What is the purpose of a standard? One of the purposes is to facilitate trade by ensuring that all countries agree a common set of requirements. International trade is based on rules set by the World Trade Organization. The ISO/IEC directives, part 1, consolidated ISO supplement 2017 Annex SM states that to facilitate trade “globally relevant standards should”
- Not stifle innovation and technological development
- Be performance based as opposed to design prescriptive
I would suggest that not allowing highly reliable electronics with high diagnostic coverage to implement high PL and SIL safety functions, but mandating less reliable systems with less diagnostics is stifling innovation and technological developments and not increasing the amount of safety in the world.
Other guidance includes the ISO/IEC directives part 2 where in sub-clause 4.2 it states “Whenever possible, requirements shall be expressed in terms of performance rather than design or descriptive characteristics. This approach leaves maximum freedom to technical development." Then when you read ISO 13849-1:2015 you find the statement “for the purposes of this part of ISO 13849, the ability of safety-related parameters to perform a safety function is expressed through the determination of the performance level” and table 7 of ISO 13849-1:2015 allows up to PL d with a non-redundant category 2 architecture. ISO 13849-1:2015 also states “In standards in accordance with IEC 61508 the ability of safety related control systems to perform a safety function is given through a SIL." No mention of HFT or CAT just SIL and PL.
Similarly, ISO/TR 23849 sub-clause 5.3 – “The level of confidence specified as a PL and/or a SIL is relevant for a specific safety function." There is no mention of HFT or CAT in the above.
ISO 13849 does not require redundancy and even using the simplified methods from figure 5 you can still achieve PL d with a CAT 2 (single channel with test) architecture. Sub-clause 4.5.1 goes further stating “with some technologies, risk reduction can be achieved by selecting reliable components and by fault exclusions, but with other technologies, risk reduction could require a redundant or monitored systems." Annex G of BGIA report 2/2008e further explains that the five designated architectures were chosen to allow a “comprehensible diagram to be obtained” and that by using the designated architectures “this dispenses with the need for development of a dedicated mathematical model” which allow you to use figure 5 or Annex K instead of your own Markov model.
Then there are CCF. Even if you have two channels you will still have common cause failures. Values in IEC 61508 are typically 1%, 2% and 10% for β. A β of 10% equates to a reduction in the PFH by a factor of 10 which corresponds to one SIL. IEC 61508-2:2010 allows a β of 25% and you can still claim iHFT=1. That means up to 25% of the dangerous failures will take down both channels. ISO 13849-1 Annex F contains a means to calculate a value of β and if you score less than 65 you get to claim a β of 2%. The dominant leg in your reliability calculations will then look something like β(1-DC) λD with credit only allowed up to a DC of 90% for a CAT 3 architecture. A single channel with DC=98% would give a similar effect.
I feel this blog was a bit of a rant so my apologies. To be somewhat balanced I will close with an extract from IEC guide 104 entitled “The preparation of safety publications and the use of basic safety publications and group safety publications."
Figure 3 - Extract from IEC guide 104
But just because you have the right to do so doesn’t mean it is right to do so.