Diagnostics: Are They Worth the Effort?

Diagnostics: Are They Worth the Effort?

I was trying to decide on the topic for this month’s safety matters blog, so I went for a run and I decided to do the blog on the topic of diagnostics. Other contenders were two different book reviews or something on reliability predictions for integrated circuits.

While I will concentrate on the assumption of a high demand SIL 2 safety function with a required SFF (safe failure fraction) of 90% I will also reference other standards such as ISO 26262 ASIL B and ISO 13849 DC=Medium which have similar requirements.

IEC 61508 has two ways to evaluate the diagnostic coverage.

DC = diagnostic coverage is the simplest and is the ratio of dangerous detected failures to the total number of dangerous failures expressed as a percentage i.e. DC = λDDD.

SFF = safe failure fraction, a different metric which includes the failures which take the system to the safe state => SFF = (λSDD)/( λSD)

In the above formulas “λ” stands for the failure rate e.g. 10 FIT and “DD” = dangerous detected failures, “DU” = dangerous undetected failures and “S” stands for safe failures.

If you assume 50% of failures are safe and 50% dangerous then you can show that SFF = 50%+0.5*DC. This means that if your diagnostic coverage is 90% then a reasonable estimate for SFF is 95%. Alternatively, if you need an SFF of 90% then diagnostics giving an average coverage of 90% should suffice.

The first question is are diagnostics even required. Most people would answer yes. However, this is not strictly true. For a SIL 2 safety function the main requirement is a PFH (probability of failure per hour dangerous) of <1e-6/h or 1000 FIT. What if I do an analysis and show my dangerous failure rate is only 900 FIT am I good to go?

One method to avoid the need for diagnostics is to follow route 2H where the H stands for hardware.  Route 2H is an alternative to Route 1H which uses SFF. However even for route 2H you require a minimum DC of 60% for complex components and minimum levels of HFT (hardware fault tolerance). So, in practice route 2H requires diagnostics unless you use very simple components.

Figure 1 - Extract from IEC 61508-2:2010 7.4.4 showing two alternative paths through the standard

At the IC level and modules levels where I generally work route 1H based on SFF (safe failure fraction) and HFT (hardware fault tolerance) is by far the most common option. The diagnostic requirements are set out in table 2(simple components) and table 3(complex components).

Figure 2 - HFT vs SFF trade-off for type B components developed following route 1H

As an example of how to use the tables, table 3 gives 3 options to meet the hardware requirements for SIL 2.

Option 1 - No redundancy and an SFF of at least 90%

Option 2 - an HFT of 1 and 60% SFF

Option 3 - NO DIAGNOSTICs required for SIL 2 if you have a HFT of 2 i.e. a triple redundant system.

Obviously, the penalty of triple redundancy is quite penal so not a great option, but I am sure there are cases where it suits to use standard elements/sub-systems with no diagnostics.

In theory with route 1H no diagnostics are required if you could show that enough of the failures took you to the safe state (tripped the system).  So, if λS=0.9, λD=0.1, λDD=0 your SFF = 90% and you are done for SIL 2 with no diagnostics. I however have never worked on a system where this is true. IEC 61508 is a basic safety standard however and so must cope with many different application domains so perhaps there are domains where this might be common.

The automotive metric is similar to the one for SFF but ISO 13849 uses DC (diagnostic coverage) as its metric and therefore no matter if 90% of your failures are safe you would still need to do something to detect 90% of the remaining dangerous failures. For me this exacerbates the problem with the SFF metric i.e. if system has an inherent reliability sufficient for the SIL why would you complicate matters by adding diagnostics. The ISO 13849 metric effectively says even if the system is already sufficiently safe (low enough PFHd) and even if most of the failures take you to a safe state (stops the machine is a typical safety state for machinery) you still need to add diagnostics!

Moving along. Let’s suppose I have an element consisting of 10 components all with an equal failure rate. Suppose I achieve an SFF of 99% for 9 of the components and 0% for the 10th component then the SFF for my element is over 90% and according to the standard I am done as you will note from the title of table 3 above that it applies at the element or sub-system level and not at the component level. While compliant it doesn’t look great.  

What would a system with no diagnostics look like? Below I have shown it as a Markov model with two states. State 1 is labelled “OK” and is the state when the system is operating correctly or in the safe state and state “KO” is when the system has failed dangerously.

Figure 3 - Markov model of a single channel system with no diagnostics

Let’s suppose the above system has an MTTFd (mean time to dangerous failure) of 100 years. The PFHd is then 1.14e-6 (1/(100*8760)) which is in the SIL 1 range according to IEC 61508.

Figure 4 - Markov model of a single channel system with diagnostics

The Markov model above shows the addition of a diagnostics channel. The path from the “OK” to the “KO” state now has a probability of λDU instead of λD where the suffix “DU” stands for dangerous undetected. If the diagnostic coverage is 90% then that path has only a 10% probability compared to the case with no diagnostics which is a very obvious improvement.

There is still a path with probability into the “KO”, failed state, with probability λD but you need to be first in the “Dfail”, diagnostics failed state, to take that path. This simple Markov model ignores common cause failures between the functional and test circuits.

So, let’s quantify the probabilities to see the bonus for diagnostics.

MTTFd of safety function in years

Diagnostic coverage

PFHd

100

0

1.14e-6

100

60%

5.28e-7

100

90%

2.29e-7


Note – I read these values rom ISO 13849-1:2015 Annex K for the CAT 2 architecture using the MTTFd = 100 years row.

From the table you can see that adding diagnostics can improve you SIL claim limit by 1 in terms of the PFHd. (SIL 2 = 1e-6 to 1e-5, SIL 3 = 1e-7 to 1e-6). This is part of the reasons why our friends in automotive have a latent fault metric to ensure that the diagnostics are still working. If they are good enough to lower you SIL claim limit (CL) by 1 then there should be a guarantee they are working.

Ideas for diagnostics can be found in many sources. IEC 61508-2:2010 tables A.2 to A.14 recommend “techniques and measures for diagnostic tests and recommended maximum levels of diagnostic coverage that can be achieved using them.”. In this table low means a DC of 60%, medium means 90% and high means 99%. So for instance if you have a watchdog timer which times out after X seconds if not kicked by the uC then the maximum diagnostic coverage that can be claimed for that diagnostic is 60% because while it has an independent clock (time-base) of its own it only monitors on the high side. If it’s a windowed watchdog timer (see my recent blog here for suggestions on a windowed watchdog timer) you can claim up to 90% DC.

Figure 5 - suggested clock diagnostics from IEC 61508-2:2010

IEC 61508-2:2010 Annex C then gives the rules on evaluating the SFF as an average for an element but an actual example of how to do it using an FMEA is found in IEC 61508-6:2010 Annex C.

Figure 6 - Example FMEA for a PCB from IEC 61508-6:2010

This example assumes the failure modes of open circuit, short circuit, drift, and function for each of the components and for each failure mode of each component ranks them as safe, dangerous or no effect (put 0 in the column). The failure modes each get 25% of the failure rate for that component and while there is no column with the assumed FIT for the components you can calculate it from the sum of the “λS” and “λDDDU” columns. This equal distribution of the FIT to the various failure modes is used where better information is not available. The DCcomp columns give separate claims for the diagnostic coverage of the safe and dangerous failures of each component because on the diagnostics given in a table C.2 of the example.

All in all, this is not a bad example with liberal use of the 50% safe 50% dangerous approximation for the failure modes which I think I will make the subject of a future blog. If you buy a certified component such as the ADFS5758 DAC the safety manual will give you the data you need and you can leave the columns up to “λS” blank.  I have tried to describe this below using a different FME(D)A format where the IC only has 4 on-chip blocks and the FIT of each block is allocated to the on-chip blocks based on the percentage area of that block.

Figure 7 using data from a safety manual to populate a system level FME(D)A

The above “toy” FMEA doesn’t break out the failure modes for each of the on-chip blocks. Typically, this will make it hard to say whether a given diagnostic will detect the failures or not. So, while it looks like a saving in analysis time it probably will not be. Often, the more granular the failure modes the easier it is to say whether a given failure mode is safe or dangerous. In theory you could analyze all the way down to the transistor level and have a set of MOSFET failure modes such as those below. However typically those detailed failure modes bubble up to a finite and common set of top-level fail modes, so generally a very low level of analysis is not warranted. Typically, it is sufficient to analyze to the level of hierarchy whether the diagnostic operates. What a detailed analysis might however give you is a better failure mode distribution but is it effort well spent? This is a question between you, your engineering judgement, and your independent assessor.

Figure 8 - Possible set of failure modes for a MOSFET based on ISO 13849-2:2012 table D.18

Similar tables to those above from Annex A of IEC 61508 are found in ISO 26262-5:2018 are the diagnostic coverage figures from ISO 26262 are reasonably consistent with those in IEC 61508.  Table E.1 of ISO 26262-5:2018 shows a different FMEDA format but with similar calculations. Perhaps a future blog will explain how to convert the ISO 26262 metrics into IEC 61508 metrics and visa-versa. The official IEC guide on how to do an FMEA is in IEC 60812:2018.

Depending on the claimed diagnostic coverage and the target SIL you may need to do fault injection/insertion to confirm the diagnostic coverage and I hate to say it but that might be for a future safety matters blog.

A different aspect of diagnostics is the required diagnostic test interval. For a high demand safety function with an HFT (hardware fault tolerance) of 0 you typically need to have a diagnostic test rate of 100x the demand rate which is quite onerous. Calculations in ISO 13849 show about a 10% degradation in the calculated PFHd if this is reduced to 25X. A big advantage of a two-channel system is that the required diagnostic test rate is dramatically reduced and perhaps a diagnostic test interval of once/month or even once/year might be sufficient depending on the SIL.

Topics I didn’t cover in today’s blog but will insert a small bit on below include:

  1. Whether the test channel implementing the diagnostics should just alert on detection of a failure or have its own means to assert a safe state (an SMOD – secondary means of disconnection)
  2. Whether a failure of a diagnostic is a dangerous failure of the safety function
  3. How some types of diagnostics can just fail to detect a failure of the main functional channel if they fail but others such as ECC can cause a failure of the main channel when they fail themselves e.g. ECC “corrects” a non-existent error
  4. Accuracy requirements of diagnostics – need to guarantee whatever safety accuracy you claim
  5.  Importance of diagnostics for systematic failure modes including failure modes of software
  6. Whether the diagnostics themselves need to be developed to a specific systematic integrity
  7. The importance of diagnostics even for non-safety applications
  8. Prognostics vs diagnostics
  9. What if you have over lapping diagnostics
  10. Do you need diagnostics for just random hardware failures or to control systematic failures including software failures

Some of the above might may good blogs in their own right.

Not much of the content in this blog matches the title of the blog. The question as to whether diagnostics are worth it are not is a moot point for functional safety. The standard says you must have them so in that sense they are worth it. It does seem wrong that if your PFHd is already low enough without diagnostics that you still must add them to meet the requirements of the standard. This seems to be an issue in all functional safety standards. Until next month’s blog, bye.

If you like Markov models I did a blog dedicated to Markov models some time ago, see here.