Functional Safety & The Cloud - Part 2

Functional Safety & The Cloud - Part 2

Sometime ago I did a functional safety and the cloud part 1 blog, and I promised part 2 on the guidance available within the standards for cloud-based safety; if such a thing is even possible. It turns out there is quite a bit of guidance available, but it is distrusted across several standards.

This blog was to be based on a paper I had submitted to an IoT conference being held in Limerick during April. Unfortunately, the paper was declined but the good news is I am now free to convert it into a blog. This isn’t the full paper but most of it is included if somewhat reordered and the language made more “bloggy”.  I debated removing some of the introductory material on safety standards and the three key requirements for functional safety but in the end, I decided to leave them in. It makes for a lengthy blog but a worthy read.

In the paper I didn’t try to distinguish between a local cloud and a remote cloud. In general, I also assumed there was no edge processing and that the logic block from functional safety was to be implanted in the cloud. Also, please don’t say why didn’t you include TSN or some other technology. I didn’t have time to cover everything and if you do need to know about TSN I have colleagues I can refer you to, just write a comment below.

Functional safety is that part of safety which deals with the confidence that an electrical/electronic based system will carry out its safety related task when asked to do so. 

The main non-sector specific functional safety standard is IEC 61508. IEC 61508 is a basic safety standard. That is a standard that is not sector specific and can be tailored to other sectors. The first released version of IEC 61508 came in 1998 and revision two arrived in 2010 with revision three coming along in 2021 or thereabouts. With a time between generations of approximately 11 years I guess it is not surprising that safety would trail the technology and perhaps that is how it should be for safety. In that case, once the technology is proven in less conservative functions it can then be ported for safety applications where it can either enhance the achieved level of safety or facilitate new applications.

From IEC 61508 many sector specific standards have been derived including some shown below but there are additional standards shown in green that while not derived from IEC 61508 uphold the same principles.

The foundation of safety according to IEC 61508 is the safety function. A safety function is a function to be implemented by a system that is intended to take a system to a safe state to prevent a specific hazard event. Examples of safety functions include

  • Stop a robot if somebody comes too close
  • Stop filling a tank if it is in danger of overflowing
  • Deploy an airbag in the event of a crash
  • Vent a tank if the pressure in it is too high

Safety functions have the following properties

  • A safe state
  • A maximum time in which to reach the safe state
  • A safety integrity level (SIL)

Safety standards generally have 3 key specifications to meet the safety integrity level requirements. In all instances safety must not just be achieved but it must also be possible to make a safety case for what has been achieved. Independent assessment of any claims is commonly required. This may include assessment from bodies such as one of the TUV or Exida.

The first requirement is to be reliable. While reliability is not sufficient to achieve safety and it is possible to build a safety system out of unreliable components, having good reliability is a great start. If components are failing unexpectedly, prematurely or too frequently the chances of the safety system completing their safety related task is slim.

Functional safety according to IEC 61508 has four SIL (Safety Integrity Levels) which give a rough order of magnitude increase in safety as you progress from one level to the next. For each there is a maximum allowed probability of failure per hour to a dangerous state as shown below.

  • SIL 1 < 1e-5/h
  • SIL 2 < 1e-6/h
  • SIL 3 < 1e-7/h
  • SIL 4 < 1e-8/h

Most safety functions are implemented using three basic subsystems and a common budget for the maximum allow failure rate is shown below.

Fig. 2. Error budget allocation for a typical safety system especially as found in the process industries

The combined sensor, logic and actuator must be able to take the system to a safe state before harm occurs.

A second requirement comes from the fact that no matter how reliable the components there is still a certain level of failure. Hopefully low, but perhaps not low enough. Therefore IEC 61508 imposes hardware fault tolerance requirements in the form of redundancy and a minimum level of diagnostic coverage. The standard even allows a trade-off between the two so that a SIL 3 safety system can be implemented with two channels each with 90% SFF (Safe Failure Fraction – a measure of diagnostic coverage) or a single channel with 99% diagnostic coverage.

The third key requirement relates to design errors. Design errors are known as systematic errors in functional safety standards. They are different from random hardware errors in that if a certain set of circumstances arise a systematic error will cause a failure with 100% probability. Systematic errors can only be fixed by a design change. All software failures are systematic failures. To prevent and catch such errors if they do occur, IEC 61508 advocates a range of techniques including design reviews, the use of coding standards, consideration of environmental conditions such as temperature and EMC as well as minimum levels of competence.

Cloud computing is based on having a large number of configurable computers available on a network that can be made available at short notice to users and applications. This paper assumes this bank of computers is available over the internet, but the computers could also be available in a local private cloud and sometimes the bank might not be so large.  It is assumed that the cloud is supported by high speed networks. Cloud based systems allow data to be aggregated and mined for hidden information. This previously unavailable data can give insights relating to productivity, wear out and even cyber security related events. Some processing may still be done at the edge (down at the sensors) to reduce the amount of data to be exported to the cloud but that is not under discussion here.

Figure 3 - cloud based safety system concept

Advantages of cloud-based processing include:

  • Data fusion in the cloud
  • The processing power available in the cloud
  • Scalability in the cloud
  • The more benign operating environment available in the cloud
  • The availability of power as distinct from having to operate from batteries or energy harvesting

The last one has a direct functional safety relevance to one of the three key requirements of functional safety.

As stated earlier most safety systems include a sensor to measure something, a logic block to make a decision on the sensed values and an actuator to take the system to a safe state. The mostly likely scenario for safety in the cloud is that the sensor and actuators remain as edge nodes and the logic block is in the cloud where more processing power, storage and the ability to combine the data from the edge nodes is possible. This contrasts with present day safety systems where simplicity is king and local safety systems rule.

To give a concrete example let us imagine a robot application according to the ISO 10218 series. Suppose a 3D TOF sensor is mounted on or near the robot. Instead of making a local decision on the data it is transmitted to the cloud where a powerful processor analyzes the images to decide if there are any objects in safeguarded space. The processor then classifies those objects to see if they are human or a support pillar and if human in which direction they are travelling before sending a stop command down to the robot if necessary. Perhaps using the cloud, the data from multiple sensors on other robots around the factory could somehow be aggregated to achieve a more advanced safety decision; including knowing that there are only three operators on the floor and they are all accounted for elsewhere. It could even know which operators have been trained specifically in robot safety and are therefore at somewhat less risk than the others but this might be troublesome under data privacy laws. In theory instead of using the distances from ISO 13855 it could even know the characteristics of the individual operators and so reduce safety distances further.

Robot based safety in the cloud is a bit of a stretch but how about other areas such as process control, power transmission and distribution, the rail industry and traffic lights which are already largely distributed systems. However, many which don’t appear distributed become distributed in the future such as automotive and medical. With the pace at which technology is moving, including the advent of 5G, it is hard to envisage when any of these may have a requirement for safety in the cloud. But these days the pace of technology seems to be growing faster than expected.

Some of these systems could include a maximum processing time of 100 MS maximum with a shorter time being more advantageous since you could allow a human to come closer and still be confident of stopping the robot before contact was made. The robot application has a much lower process safety time than say something used to check for over-fill of an oil tank in the process industries, but the tougher case is useful here to expose the issues.

Finally, we get to look at what guidance is presently available in the standards. While there is nothing specific to new technologies such as the cloud in the present safety standards some of the available information can be interpreted to suit the problems at hand. This section presents some of the guidance which is relevant.

Starting with guidance on networks. IEC 61508 advocates a black channel approach to safety of networks and references IEC 61784-3 and IEC 62280 series. In the black channel approach standard components are used in the network including all bridges and gateways and safety is assured using an SCL (safety communications layer) on each end.

 Figure 4 - Threats and defences from IEC 62425

The SCL needs to implement at least one of the indicated defenses against each of the threats. IEC 62425 goes on to indicate various categories of transmission systems with the effort put into the defenses depending on the category. Anything involving the cloud would imply a category 3 transmission system as it is out in public and therefore all the measures would need to be aggressively implemented.

The difference between the black channel and the cloud is that in a normal black channel you are trying to prove that the data that left the transmitting node arrives unchanged, in the correct order and in a timely manner at the receiver node. With the cloud model you have the networking, you have the standard safety components but the whole point of the cloud is that it should modify the data in some fashion. This is where the black channel analogy breaks down. However if you view the up-link and the down-link as separate conduits you could still apply the black channel approach.

Next let’s see what cyber security guidance is available. If you are not secure, then you cannot be safe. The IEC 61508 functional safety standard references IEC 62443 for all cyber security concerns. This standard relies on network segmentation based on a zones and conduits model and it was covered in an earlier blog, so I will not dwell on it here other than to say systems talking to a remote cloud server would probably need to be at SL (security level) 3.

In regards to hardware reliability requirements, each SIL comes with a maximum PFH (or PFD). For SIL 3 the maximum allowed PFH is 1e-7/h. A typical error budget for this is 35% to the sensor, 15% to the logic and 50% for the actuator. In this case the cloud is the logic. From this 15% budget to the logic 1% is allocated to the networks between the sensor and the cloud and the cloud and the actuator. This 1% represents a failure rate of 1e-9/h for SIL 3. Typically, the actual network dangerous failure rate per hour is a function of the BER (bit error rate), number of bits per packet, number of packets per hour and the Hamming distance of the CRC used to detect errors in the message. I believe a previous blog has covered this topic.

The 15% for the logic equates to a maximum dangerous failure rate per hour of 15 FIT (15e-9/h) for a SIL 3 safety function. It is assumed that the cloud companies use very reliable servers and have access to good failure rate information. Therefore, it should be possible to calculate a failure rate λ for the servers and use the generally acceptable conservative assumption that the failures will be 50% safe 50% dangerous to deduce λS and λD respectively. Thereafter either online diagnostics, perhaps even a watchdog timer in the sensor or actuator block could be used to derive a lower dangerous undetected failure rate λDU from λD based on the diagnostic coverage DC. If the calculated probability of dangerous failure rate per hour is still too high some sort of redundancy could be implied using a 1oo2 or 2oo3 architecture. I note that the Falcon series of rockets from Tesla use standard servers in a redundant configuration to achieve high levels of reliability and a similar approach could be used if the reliability of individual servers was not sufficient.

Achieving a SIL requires certain levels of reliability and fault tolerance but also requires measures to protect against systematic (design) errors. Systematic errors can affect both hardware and software. One good protection against systematic errors is through the use of diversity and two diverse SIL 2 systems can be used to claim a SIL 3 level of systematic capability. For logic in the cloud the main concern would be the software and there are many techniques described in the literature and standards on achieving high integrity software and independence of high integrity software from lower integrity software running on the same CPU. While there are papers out there indicating that Linux and Windows operating systems are safety certifiable it may be best to run the software bare metal and implement a safety certified hypervisor or RTOS to get the required independence. It could in theory even be possible to use an FPGA in the cloud and this is now offered by some cloud providers. In that case IEC 61508-2:2010 Annex F Table F.1 and table F.2 would be relevant for the FPGA vendor and person writing the application HDL for the FPGA respectively. Other solutions could include running high performance software up in the cloud and running a much simpler sanity checker at the edge which checks that the conclusions of the cloud software to a safety limit. Strictly speaking this isn’t safety in the cloud, as the main guarantor of safety is then the local simpler safety system.

A good reason to run software in the cloud might be to get access to AI/machine learning. Something which often surprises people is that there is a restriction in IEC 61508 on the use of AI in a safety function. Further there appears very little appetite to change this situation. The restriction is found in IEC 61508-3:2010 table A.2. The rationale for the restriction is not only a lack of knowledge on the topic but also the non-deterministic nature of AI. It is very hard to explain why an AI based system took a specific course of action and little confidence that it would react the same way again with the same inputs. Both are anathema to safety engineers. Within automotive many companies see AI as the means to achieve autonomous driving but with the use of AI and the large size of the code base, proving it is safe appears to be relying on testing as opposed to the use of a rigorous development process based on a functional safety standard. It has often been acknowledged that using testing to prove software is correct is not possible and papers are now appearing to show the amount of testing that would be required to prove with 90% confidence that an AI based safety system is at least as safe a human driver. I believe the figure was 800 billion miles of real-world driving.

As I said in the introduction there is very little out there on the use of new technologies such as the cloud, to implement safety systems. However, given the possible benefits this position must eventually change. Perhaps the area of IIoT, IoT, smart factories, smart cities, smart power networks are moving so fast and are such target rich environments that there has been no time to consider safety and it is left to traditional means for its implementation. Given that the safety standards only change on average every 10 years the guidance in the standards is always going to be behind the curve. This for instance is the case in autonomous driving where most of the big companies seem to be trying to proceed without ISO 26262. Whether they will succeed in convincing the public and the regulators of their safety remains to be seen. For industrial applications it may still be possible to apply IEC 61508 by interpreting the standard for the new area. Since IEC 61508 is a basic safety standard it is designed to work in this way and to claim compliance you need to comply only with the applicable requirements. A goals based as opposed to a prescriptive standards gives more flexibility in how safety is achieved.

This week’s video is https://www.youtube.com/watch?v=eYpFKVQDle0 which shows some of the areas where Analog Devices are focusing their Industrial IoT efforts.

Some useful reading on topics raised above include

  • Preliminary assessment of Linux for safety related systems
  • Software for dependable systems: Sufficient evidence?