Checking safety system in a factory setting

Availability vs Safety

The blogs in this series mostly concentrate on safety with excursions into related areas such as industrial cyber security. In this blog, I will talk about concerns regarding availability.

Availability sounds like a simple concept but once you start pushing down into it, it’s more complicated than you think and for me is not well covered in the safety standards or indeed other industrial literature.

Availability becomes important when the cost of shutting down a system is too high. So, for instance in machine safety, if we detect a failure in the safety system, we shut down the machine. Typically, the cost of shutting down a machine is not too high. Never a great idea, but acceptable in the name of safety. However, in an oil refinery, the cost of shutdown can be very very high so while we need the safety, we also need a backup plan so that if the safety system misbehaves production can continue i.e. we need availability.

To illustrate the difference between safety and availability here are a few examples of the differences between safety and availability. If I go around to your house and cut the wire to the ignition switch on your car, your car will now be very safe (you can’t drive you can’t crash) but its availability will be zero (I guess you could still sleep in it).

Conversely, if you have a robot application protected by a light curtain. If you bypass the light curtain the availability of the robot will improve because it stops shutting down when the operator breaks the light curtain, but the safety will tend towards zero.

These two silly examples highlight that you can implement a system with independent levels of safety and availability.

Next, let’s answer some basic questions.

Why do systems lead to someone getting hurt?

  • The equipment under control does something dangerous AND the safety system has failed in such a way that it can’t react now that a dangerous situation has arisen – i.e. a safety demand in the presence of a failure of the safety system.

Why do we lose availability?

  • The safety system fails dangerous detected and the equipment under control is shut down.
  • The safety system fails safe and the equipment under control is shut down.
  • The equipment under control itself fails to function.

Number 3) is not a function of the safety system and so can be ignored for this blog. In this blog when I talk about availability, I are concerned with 1) and 2). For 1) and 2) the assumption is that the safety system will be repaired, and the equipment will become operational after some repair time.

This leads to the following definition of availability found in IEC 61069-5.

Figure 1 definition of availability from IEC 61069-5

However, I note this equation uses “mean time to failure”. It doesn’t restrict the failures to dangerous detected or safe detected. Without those restrictions it also includes dangerous undetected failures, such failures don’t shut down the equipment under control and so has no impact on availability (you could argue since we mean safety availability, even though the equipment under control is operating it is not operating safely).

So, for me I would replace “mean time to failure” with “mean to spurious trip” where a spurious trip is a trip which is not related to a demand on the safety system but rather by a failure in the safety system. You could argue that a dangerous undetected failure means the safety system is unavailable but for me that situation is already included in the PFH/PFD value.

So, safety availability is covered by PFH/PFD and production availability by mean time to spurious trip/ (mean time to spurious trip + mean time to restoration). You can then look for architectures which emphasise one or the other or both.

While the level of safety achieved might be expressed with a PFH (probability of dangerous failure per hour) or PFD (probability of failure on demand) availability is expressed as 99.999% or 0.99999 (both showing 5 nines availability). Given there are approx. 10k hours in a year, 5 nines availability means a downtime of 10k (1-0.99999) =0.1 hours/year (6 minutes/year).

Now let’s look at some architectures which can be used to achieve our safety and availability goals.

Redundant architectures are often described as MooN where N are the total number of units present and M is the number of those N units required for a system trip to occur e.g. in a 1oo2 system there are two items and any one of them on its own its sufficient to trip the system.

If we describe a simple non-redundant safety system as 1oo1 a high safety system is something that has a lower dangerous failure rate than 1oo1 and a high availability system is something that can keep production operational for more of the time than 1oo1. The holy grail is something which offers both high safety and high availability. An architecture which is both safer and with higher availability that a 1oo1 architecture.

The table below shows a comparison of typical architectures and shows how some achieve high safety (higher than 1oo1) and some high availability (higher than 1oo1).

Figure 2 - a comparison of safety architectures compared to a simple 1oo1 architecture.

As an example of how to read the table. For 1oo2 you will get higher safety since either of the two redundant items can demand the safety function but in fact your availability will be worse because a safe or dangerous detected failure in either of the two will cause a spurious trip of the safety system.

For 2oo2 you don’t get an increase in safety,  in fact you get slightly less safety than 1oo1 since both redundant items need to demand the safety function before it actives. However, you will get higher availability since a spurious failure in either channel won’t lead to a trip and subsequent shutdown of the equipment under control.

From the table you can see that architectures such as 1oo2D (see previous blogs here and here) and 2oo3 achieve both high safety and high availability i.e. better in both regards compared to 1oo1.

There is an excellent Exida Video available here which puts numbers on the availability and safety of some of the architectures in the table above.

I hope you have enjoyed this blog.

To learn more – see an excellent presentation here.

For previous blogs in this series see here.

For the full suite of ADI blogs see here. Register in the community and turn on blog notifications to get the latest blog updates.

For the full range of ADI products see here