How Wide is Your Tolerance Zone?

Clean up on aisle three, clean up on aisle three. Turn the corner for aisle three and you are faced with a robot pacing (sort of) back and forth to prevent shoppers from stepping in or on the spilled item. As you approach the robot stops well before it will run into you. We know that this is due to a sensor but exactly how do they calculate the robot's stopping range?

This blog is inspired not by supermarket trips but by reading a draft of IEC 61496-4-3 Annex BB. IEC 61496-4-3 is a standard for stereoscopic sensors for human presence detection. The standard discusses what repeatability is required for measurements. The topic is also described in related standards IEC 61496-3 covering 3D TOF and IEC 62988 that discusses similar sensors not included in the scope of the IEC 61496 series.

Figure 1: picture from IEC 61496-3 Annex BB

Let’s start with some background on robot safety. In front of a robot there will be a detection zone to recognize when someone approaches the robot and then stop the robot before the person can reach the hazard. Looking at the picture above, you see that a tolerance zone must be added to the detection zone to ensure that the probability of detecting someone, even in the presence of noise, is sufficiently high given the variability found in the distance measurements. This means that if your safety calculations say you should detect someone at 1 meter from a robot, so that you are guaranteed to have the robot stopped before they reach the hazard, you might need to set the trip point at 1.05, 1.1 or 1.105 meters depending on the repeatability of your distance measurement. But how do we determine the width of that tolerance zone?

Let’s look at the math behind repeatability measurements. If you measure any analog quantity such as distance, voltage, current, power, temperature 1,000 times you probably won’t get exactly the same value each time unless your sensor resolution is really poor compared to the noise. More likely you will get a set of measurements which are often characterized using a normal distribution. This means that the measurements are characterized by a mean value and a measurement of the spread given by a standard deviation shown by the bell-shaped curve in the above graph. If the measurement is truly normal then 68.3% of the measurements will be within 1 standard deviation of the mean, 95.4% within 2 standard deviations of the mean and 99.7% within 3 standard deviations of the mean.

But how does this relate to SIL.

For a SIL 1 function the probability of failure per hour needs to be < 1e-6/h, SIL 2< 1e-7/h, SIL 3< 1e-8/h and SIL 4 < 1e-9. Let’s say, for example, this is a SIL 2 function so that the maximum allowed PFH is 1e-6/h. So being right 68%, 95% or 99.7% of the time won’t be enough unless the safety function is demanded (the safety system is tripped) only rarely.

Mostly people think that PFH only considers random hardware failures but that isn’t what the definition of PFH says. The definition doesn’t just limit itself to random hardware failures and can include noise.

Figure 2: Definition of PFH from IEC 61508-4:2010

Therefore, the definition can include measurement noise.

Let’s suppose our safety function consists of a sensor, a logic block and an actuator. A widely accepted allocation of the PFH is 35% to the sensor so that the maximum PFH now allocated to our sensor for a SIL 2 safety function is 350e-7/h and further lets allocate 50% of this to the random hardware failure and 50% to the noise; then the maximum allowed failure rate due to the noise is 1.75e-7/h.

Now it gets tricky. We assumed the 3D TOF sensor is protecting a robot in a cage. The safety function might only be called upon to act (demanded) once per day so that the demand rate is 1/24 per hour. However, it could also be a collaborative robot and the estimated demand rate could be 17.5 times per hour (this demand rate chosen to make the maths neat). Let’s work with the collaborative robot example so the maximum allowed probability of failure on demand becomes 1.75e-7/h / 17.5/h =1e-8.

Let’s suppose we perform 1,000 measurements of a fixed object at 1m from the 3D TOF sensor and crunch the numbers getting a standard deviation of 20mm. What guard-band do we need to ensure our probability of failure on demand is < 1e-8? The Excel function norm.inv can be used to give us an answer of 112mm, norm.inv(1e-8,0,20e-3). Therefore, setting our trip point at 1.112m guarantees the failure rate is sufficiently low to allow for the noise in the system.

Figure 3: an Excel spreadsheet to do the calculations

Above is shown an Excel spreadsheet to do the calculations described above where the user enters all the data in yellow and the trip point is given in the green cell. Similar logic could be used to determine guard-bands for temperature, voltage or other such measurements.

For the assumed standard deviation of 112mm this means a guard-band of 5.6 sigma.

ISO 13849 takes a simplified approach to safety. Examples include the risk assessment used which just determines a PL which then imposes a worse case value for the PFHd and the allowed PFHd is then used as the maximum for the range rather than a quantitative approach which would give a value of PFHd within the PL band ( 1e-6/h to 1e-7/h). Therefore, if the risk assessment determines a need for a PL d safety function the PFHd limit would become 1e-6/h.

IEC 61496-3:2018 Annex BB suggests a value of 5 sigma as a tolerance zone for 3 demands/h but doesn’t give any hints as to how that is good enough. It is probably fine for PL c and below but for PL d it uses up the entire PFHD budget and for PL e would be inadequate.

For complex systems I really think they should be designed to IEC 61508 and both ISO 13849 and IEC 62061 state as much, but I think people are still viewing them as design standards rather than applications standards but perhaps that is a blog for another day.

Other implications of the above:

• If you have multiple sources of variation contributing to the overall system repeatability a linear sum of error sources is not required. An RSS sum should be fine provided the multiple error sources are uncorrelated. See the central limit theorem for more details.
• Rather than quoting min/max specs on a datasheet it might be better to quote the sigma of the distribution
• If you only have min/max specifications, then the sigma could be estimated as (max specification value – minimum specification value)/6 if no other information is given. In some cases, however this might be very conservative.
• If the sensor needs to be regularly calibrated, then the repeatability measurement should really include a calibration per measurement. Then why limit the measure to just one device shouldn’t it really be 1,000 devices.
• For something like a 3D TOF sensor the sigma of the measurement may vary with the distance to the test object. Therefore, multiple sigma might be needed.
• It could be argued that even IEC 61508 has issues in the confidence level required for reliability numbers is only 70%. Therefore, there is a likelihood that the PFH values are being violated by the hardware reliability and applying such rigor to the noise measurements is wasted. (I need to think about this a bit more.)

You sometimes need to be careful as to which part of a sensor is important. For instance, in a 4/20mA DAC (digital to analog converter) with an ADC (analog to digital converter) as a diagnostic is it the repeatability of the DAC that matters or of the ADC (analog to digital converter). I will let you figure that out for yourself.

One thing however is still bothering me. The reason we decided we needed SIL 2 is because we did a risk assessment. But isn’t the demand rate already factored into that and so is it correct to include it again here? Doesn’t the demand rate affect the hazardous even rate rather than the PFH? I will have to do a part 2 of this blog to explore. How long it takes for the part 2 blog depends on how long it takes me to convince myself I have understood the topic. All comments and suggestions welcome.