Checking safety system in a factory setting

Availability vs Safety

The blogs in this series mostly concentrate on safety with excursions into related areas such as industrial cyber security. In this blog, I will talk about concerns regarding availability.

Availability sounds like a simple concept but once you start pushing down into it, it’s more complicated than you think and for me is not well covered in the safety standards or indeed other industrial literature.

Availability becomes important when the cost of shutting down a system is too high. So, for instance in machine safety, if we detect a failure in the safety system, we shut down the machine. Typically, the cost of shutting down a machine is not too high. Never a great idea, but acceptable in the name of safety. However, in an oil refinery, the cost of shutdown can be very very high so while we need the safety, we also need a backup plan so that if the safety system misbehaves production can continue i.e., we need availability.

To illustrate the difference between safety and availability here are a few examples of the differences between safety and availability. If I go around to your house and cut the wire to the ignition switch on your car, your car will now be very safe (you can’t drive you can’t crash) but its availability will be zero (I guess you could still sleep in it).

Conversely, if you have a robot application protected by a light curtain. If you bypass the light curtain the availability of the robot will improve because it stops shutting down when the operator breaks the light curtain, but the safety will tend towards zero.

These two silly examples highlight that you can implement a system with independent levels of safety and availability.

Next, let’s answer some basic questions.

Why do systems lead to someone getting hurt?

  • The equipment under control does something dangerous AND the safety system has failed in such a way that it can’t react now that a dangerous situation has arisen – i.e. a safety demand in the presence of a failure of the safety system.

Why do we lose availability?

  • The safety system fails dangerous detected and the equipment under control is shut down.
  • The safety system fails safe and the equipment under control is shut down.
  • The equipment under control itself fails to function.

Number 3) is not a function of the safety system and so can be ignored for this blog. In this blog when I talk about availability, I are concerned with 1) and 2). For 1) and 2) the assumption is that the safety system will be repaired, and the equipment will become operational after some repair time.

This leads to the following definition of availability found in IEC 61069-5.

Figure 1 definition of availability from IEC 61069-5

However, I note this equation uses “mean time to failure”. It doesn’t restrict the failures to dangerous detected or safe detected. Without those restrictions it also includes dangerous undetected failures, such failures don’t shut down the equipment under control and so has no impact on availability (you could argue since we mean safety availability, even though the equipment under control is operating it is not operating safely).

So, for me I would replace “mean time to failure” with “mean to spurious trip” where a spurious trip is a trip which is not related to a demand on the safety system but rather by a failure in the safety system. You could argue that a dangerous undetected failure means the safety system is unavailable but for me that situation is already included in the PFH/PFD value.

So, safety availability is covered by PFH/PFD and production availability by mean time to spurious trip/ (mean time to spurious trip + mean time to restoration). You can then look for architectures which emphasize one or the other or both.

While the level of safety achieved might be expressed with a PFH (probability of dangerous failure per hour) or PFD (probability of failure on demand) availability is expressed as 99.999% or 0.99999 (both showing 5 nines availability). Given there are approx. 10k hours in a year, 5 nines availability means a downtime of 10k (1-0.99999) =0.1 hours/year (6 minutes/year).

Now let’s look at some architectures which can be used to achieve our safety and availability goals.

Redundant architectures are often described as MooN where N are the total number of units present, and M is the number of those N units required for a system trip to occur e.g. in a 1oo2 system there are two items and any one of them on its own its sufficient to trip the system.

If we describe a simple non-redundant safety system as 1oo1 a high safety system is something that has a lower dangerous failure rate than 1oo1 and a high availability system is something that can keep production operational for more of the time than 1oo1. The holy grail is something which offers both high safety and high availability. An architecture which is both safer and with higher availability that a 1oo1 architecture.

The table below shows a comparison of typical architectures and shows how some achieve high safety (higher than 1oo1) and some high availability (higher than 1oo1).

Figure 2 - a comparison of safety architectures compared to a simple 1oo1 architecture.

As an example of how to read the table. For 1oo2 you will get higher safety since either of the two redundant items can demand the safety function but in fact your availability will be worse because a safe or dangerous detected failure in either of the two will cause a spurious trip of the safety system.

For 2oo2 you don’t get an increase in safety, in fact you get slightly less safety than 1oo1 since both redundant items need to demand the safety function before its active. However, you will get higher availability since a spurious failure in either channel won’t lead to a trip and subsequent shutdown of the equipment under control.

From the table you can see that architectures such as 1oo2D (see previous blogs here and here) and 2oo3 achieve both high safety and high availability i.e. better in both regards compared to 1oo1.

There is an excellent Exida Video available here which puts numbers on the availability and safety of some of the architectures in the table above.

I hope you have enjoyed this blog.

To learn more – see an excellent presentation here.

For previous blogs in this series, see here.

For the full suite of ADI EngineerZone blogs, see here. Register in the community and turn on blog notifications for the latest updates.

For the full range of ADI products, see here

  • Hi John, the distributed system is similar with MooN structure of FS. Here's the answer from ChatGPT:

    The CAP Theorem and the 2 out of 3 structure in functional safety address different concepts, but there are some analogous principles worth exploring.

    ### CAP Theorem

    The **CAP Theorem** applies to distributed systems and states that a distributed system can achieve at most two of the following three properties simultaneously:

    1. **Consistency**: All nodes see the same data at the same time.
    2. **Availability**: Every request receives a response, even if some nodes fail.
    3. **Partition Tolerance**: The system continues to operate despite network partitions or communication failures.

    ### 2 out of 3 Structure in Functional Safety

    In the context of functional safety, particularly in standards like ISO 26262, the **2 out of 3 structure** refers to a fault-tolerant architecture where:

    1. **Two out of Three Redundancy**: The system uses three redundant components (e.g., sensors, processors) to achieve high reliability. The system can tolerate the failure of one component while still maintaining functionality.

    2. **Voting Mechanism**: The system employs a voting mechanism to ensure that the output is based on the majority, thus allowing it to operate correctly even if one of the three components fails.

    ### Analogies and Differences

    **Similarities:**

    1. **Trade-Offs and Redundancy**:
    - Both concepts deal with trade-offs and redundancy. The CAP Theorem deals with trade-offs between consistency, availability, and partition tolerance in distributed systems. The 2 out of 3 structure deals with redundancy and fault tolerance to ensure reliability in safety-critical systems.

    2. **System Resilience**:
    - Both aim to enhance system resilience. The CAP Theorem helps design distributed systems that can handle network partitions and failures while balancing consistency and availability. The 2 out of 3 structure ensures that the system remains operational despite individual component failures.

    **Differences:**

    1. **Scope**:
    - The CAP Theorem specifically applies to distributed systems and their handling of network partitions. It’s a theoretical model used to understand trade-offs in system design.
    - The 2 out of 3 structure is a practical engineering approach to fault tolerance in safety-critical systems, aiming to maintain functionality and reliability despite individual component failures.

    2. **Focus**:
    - The CAP Theorem focuses on system properties related to data consistency and availability in the presence of network issues.
    - The 2 out of 3 structure focuses on hardware and component redundancy to ensure that the system meets safety requirements and maintains operational integrity despite component failures.

    ### Application to Functional Safety

    While the CAP Theorem itself is not directly applied to functional safety, the underlying principles of trade-offs and redundancy are relevant. Functional safety systems often require high reliability and fault tolerance, and understanding the trade-offs between different system properties (such as availability and fault tolerance) can help in designing robust safety-critical systems. In both cases, the goal is to balance and manage various aspects of system performance to achieve desired outcomes, whether it's in distributed systems or safety-critical applications.

  • Distributed systems people like to think about availability as one

    of three attributes -- Consistency, Availability, and Partition

    Tolerance (CAP) -- which led to work on how realistic it is to

    expect all three attributes to be true at the same time:

    en.wikipedia.org/.../CAP_theorem

    I wonder if there is a connection with safety/availability.