Conservatism in Safety

Conservatism in Safety

This blog was prompted by some data I saw last year that showed that while the number of robots in use is increasing rapidly, the number of people hurt or killed by robots is falling.

One reaction to this news might be to congratulate all of those who participate in robot standards, the robot designers, the integrators, and the users of robots for a job well done, but is that the right view? Could it be argued that robots are too safe, that we are too conservative? If our conservatism makes safety and the EUC (equipment under control), such as robots too expensive, then what about all the people still doing the dirty, dangerous, and dull jobs (the 3D) that are too expensive to automate? If being less conservative makes it possible to automate the 3D jobs, couldn’t it reduce the overall level of harm in society even if an individual robot was potentially more dangerous?

Of course, the acceptable level of risk is a societal thing (this is on my to-do blog list) generally enforced by laws and guided by standards. Within standards, the fact that we have four SIL and five PL for industrial safety show that there is an acceptable level of risk reduction. You don’t need to “throw the kitchen sink” at every risk reduction effort. If the goal was risk reduction at any cost wouldn’t everything be SIL 4, PL e? Similarly, some level of safety is required by law and I’m not a legal person so I will avoid commenting on that. What I can say about the law however is that the concept of strict liability is onerous and can make a manufacturer responsible even when they have not been negligent.

Below I will give examples of conservatism in safety and perhaps let you draw your own conclusions.

Example 1 – BER/BEP for functional safety of networking

Most safety functions require a network connection. If that network is a fieldbus then typically the black channel approach (see here) is used. But what failure rate must you use for the equipment in the body of the network (the routers, gateways, etc)?  For black channel designs the standard IEC 61784-3 is often used and mandates a BEP (bit error probability) of 0.01 i.e., that 1 in 100 bits get corrupted (see IEC 61784-3 9.5.3). This seems very conservative, so where does the value come from? The value includes both bits corrupted due to EMI (electromagnetic immunity) and the failure rate of the components used to build the network.

One explanation for the value (see note in table 2 of IEC 61784-3:2021), is that a BEP of 1e-4 “in the presence of continuous electromagnetic interference would lead to a stop of communications in case of cyclic data exchange”. Therefore, you use the worst-case value that wouldn’t get noticed as a lack of valid messages being received. Then on the basis that safety should consider, burst interference on a single message, you must use a BEP 100x higher to be conservative.

Another argument is made in IEC 61784-3:2021 5.8.4, based on an assumed dangerous failure rate of 100 FIT (justified based on ISO 13849-1:2015 table 7) for an average black channel device and then using a safety margin of 10000(yes, a safety margin of 10000) you are up to a BEP of 1e-3. I think the standard then assumes perhaps 10 of those average devices are used to form the network giving a BEP of 1e-2 (busy day I didn’t have time to double check the 10X factor).

Most networks will have defenses to detect corruption but you are generally not allowed to claim any credit for those defenses since they weren’t developed to the rigor of a functional safety standard.

The above is certainly conservative but can you be too safe?

Note – previously the failure rate was expressed as a BER (bit error rate), but BEP is more appropriate when you don’t know how much data is being transferred in each time unit. Often, BER and BEP are used interchangeably even though rates and probabilities are very different things.

Example 2 – Mixing systematic and random hardware failures in SN29500

Now for my second example of conservatism.

Standards such as IEC 61508 treat systematic and random failures differently. Systematic failures are handled by following a rigorous design process to prevent the introduction of errors. This includes using only suitable components, using those components within their specifications, making sure all the documentation is available to the designers, and protecting against over voltage events (EOS) and general EMI (electromagnetic immunity).

Random hardware failures are dealt with using techniques such as diagnostics (SFF or DC) and redundancy (HFT or CAT).

However, when doing our reliability predictions, we typically use sources of failure rates that mix random and systematic. For me, this is a 100% margin of error. Generally speaking, if you design to a given SIL or PL then the measures taken are meant to reduce the systematic failure rate to be commensurate with the random failure rate. SN29500 allows no reduction for the use of such a rigorous development process which should eliminate most if not all of the systematic failures.

There is also no allowance for the differences between the suppliers. The predictions are based on a mix of suppliers. Some suppliers offer higher quality than others (see for more).

Example 3 – Use of confidence levels.

Standards such as IEC 61508 mandate various confidence levels for your data. A confidence level of 50% should mean your prediction is conservative as often as it is optimistic. However reading IEC 61508 and depending on the situation, the required confidence level,  can be 70%, 90%, 95%, or even 99.9%.

A widely used source of reliability predictions is SN29500. However, nobody seems to know, what is the confidence level of the data in SN29500. Some sources suggest it is at the 99% confidence level which is approximately a 10X safety margin over just wanting to be right on average as shown by the table below.  

Figure 1 - a reliability prediction done at various confidence levels

For anybody who doubts that a 99.9% confidence level is sometimes required see IEC 61508-2:2010 table B.5 on field experience. Low effectiveness is recommended for SIL 1,2 and high effectiveness for SIL 4 with a 99.9% confidence level on your reliability data. Even for SIL 1,2 is it 95% and while SIL 3 is not covered it would seem consistent to expect the reliability to be at a 99% confidence level.  

Figure 2 - An extract from IEC 61508-2:2010 table B.6

I have been thinking about writing a blog like this for some time but couldn’t decide how to get across my concerns. It seems odd for a safety guy to be asking are we too safe. I’m still not sure I have expressed what I really wanted to say on this topic but hopefully, it’s a start and I will continue to think about it and perhaps do a follow-on blog later.

For the full set of blogs in this series please see here.

Of particular relevance to this topic are

                Reliability predictions for integrated circuits – see here

                How to change confidence levels – see here

                The cost of implementing safety – see here

                The cost of not implementing safety – see here