Functional safety has requirements to cope with both random failures and systematic failures. Software only has systematic failures since software doesn’t have random failures and since if the same circumstances arise a software fault will usually cause the system to fail in the same way every time. One way to reach higher levels of safety is to implement a two-channel system with diverse software in each channel. Redundant channels with identical software would have the software as a single point of failure. If the two channels have diverse software, then the argument goes that they are both unlikely to fail in the same way at the same time which allows for a higher SIL claim. Sounds great but is there a catch? Let’s look a bit deeper or as deep as you can go in a blog.
First let’s look at what guidance is given in IEC 61508, then look at what guidance is available in the literature and some design patterns based on diversity.
In IEC 61508-3:2010 this is covered by the sub-clauses below
Figure 1 - a relevant excerpt from IEC 61508-3
Looking at IEC 61508-2:2010 sub-clause 7.4.3 says that the SC (systematic capability) can only be increased by at most, one level. So, for instance if both pieces of software had a SIL claim of SIL 1 then the combination would be SIL 2 at most. I imagine that the restriction of only allowing an increase of at most one is in place because the person combining the two items doesn’t know the details of the individual developments and perhaps there could be some hidden common cause failures such as the tools used. If developing both pieces of software from scratch you may be able to do better.
The tables in IEC 61508-3 Annex A give some guidance with Table A.2 offering four alternative versions of a diverse architecture and Table A.10 asking for a CCF (common cause failure) analysis of the software with this measure a recommendation at SIL 2 and highly recommended for the higher SILs.
Figure 2 - an excerpt from IEC 61508-3:2010
But how hard is it to develop diverse software. Philip Koopman in his excellent book, “Better Embedded System Software” section 26.3.3 has a nice comment on the topic. In this section he states that it is really difficult to implement truly diverse software but easy to get some level of diversity. He states it is also difficult to quantify the diversity that is achieved which isn’t surprising as hardware CCF analysis which has far more guidance in the standard, is still more engineering judgement than science. Philip Koopman further warns that “many people (including us), think that if you have limited time and resources, you’re better off making one really good version of software than attempting to make two independent versions which are not nearly as good on their own. There will likely be too many bugs that are the same in both versions.
I looked to see if I had any research to support this view. The most interesting note on the topic that I have seen is the one shown below where they gave a specification to 27 students and asked them to write software to implement it and then checked to see how many of the diverse pieces of software failed in the same way. It does backup the view that it is really hard to write diverse software.
Figure 3 - interesting experimental paper on the value of diverse software
Then there are the HSE figures which show very few bugs in the coding phase (design and implementation) which suggests that unless you have diversity in the specification you don’t get a lot of benefit.
Figure 4 - why systems fail from the HSE
The team developing the fly by wire software for the Boeing 777 appear to have taken this on board with three different pieces of software developed to three different specifications, using three different development teams who are not supposed to talk to one another, running on three different (diverse) computers controlling the plane. A voter was then used to select the course of action when one of the outputs did not agree with the others. See the paper “Design consideration in Boeing 777 fly-by-wire computers” for more information.
The shuttle used a somewhat similar architecture using five computers, four identical and one diverse. The software on the diverse microcomputer was also diverse. More on this can be found in the paper, computer architecture for the shuttle.
One design pattern for functional safety software based on diversity is N version programming which uses multiple versions of different code developed to the same set of requirements with voting on their outputs.
Figure 5 - drawing of N-Version programming pattern (Design patterns for safety-critical embedded systems)
If we view the above as a reliability block diagram, then the voter is the obvious weak point as a source of CCF and unless the voter is ultra-reliable then the benefit to be gained from a high value of N will be limited.
Let's compare the diverse software approach to some alternatives. A dual core lockstep microcontroller does not implement software diversity but is rather a hardware safety mechanism as both cores will run the same software. Software lockstep/software RMT by contrast, as distinct from cycle by cycle lockstep, can implement software diversity but will have a longer time to detect discrepancies than a clock cycle by cycle lockstep approach. Software lockstep can be run on different processors or even run on redundant threads of a single processor and at selected watch points compare their outputs.
Even if you implement diverse software what about the tools used to produce the software? These could also be a source of common cause failures but if this is considered in a CCF and different tools are chosen, or the tools are chosen to meet the SIL requirement for the overall safety function, or you use tools suitable for the SIL of the combined elements you are probably good to go. Revision 3 of IEC 61508 in 202X will give more guidance on software offline tool requirements taking the SIL of the overall safety function into account.
-
in reply to WhoCaresin reply to Tom-M