Failure of a system can result from deliberate or non-deliberate means. The deliberate means are handled by our cyber security colleagues and functional safety is generally more concerned with failures caused by non-deliberate random and systematic failure modes. This leads to the question – is there any such think as a random failure or are all failures systematic if you look hard enough?
Figure 1 - failure types
Systematic failure modes are effectively built into the system and represent a weakness in all products of a particular type. Systematic failure modes can be caused by things like...
None of these are random but this is not the question. The question once again is, are there any random failures? Let’s start by looking at the definition of a systematic failure just to make sure we have that clear in our minds.
Figure 2 - definition of systematic failures from IEC 61508-4:2010
IEC 61508 and other functional safety standards offer a full safety lifecycle to try and prevent the introduction of systematic errors. When you meet all the required measures to address systematic errors according to IEC 61508 for a given SIL you have a certain systematic capability in the range SC1 to SC4. Systematic capability is a key feature of design standards such as IEC 61508 that is not as well addressed in older or application specific standards such as ISO 13849.
In contrast, random failures are meant to be just that random. The system or IC has been properly designed and all suitable measures taken but there is still a small probability that the IC or system will fail. In reliability theory, the reliability of an item is often shown with a bathtub curve. At the start of life, the failure rate is higher due to things like manufacturing defects and at the end of life you enter the wear-out phase. In between is the useful life phase and while I won’t go into it today all kinds of useful maths can be done during this phase and it is used for reliability predictions during a functional safety analysis (Chapter 2 of the Smith book below gives a particularly readable explanation of the issues). Choosing components not suitable for your mission profile (a matrix of temperatures at those times) such that components in your system enter their wear-out phase would be another good example of a systematic failure so we can rule out that portion of the random failures.
That leaves the early life and useful life random failures. The early life failures can be addressed using techniques such as burn-in or various screening tests and you could argue that not having a suitable screening test is a systematic failure mode (bad manufacturing). Also, you could argue that for an IC the wafer fab should have a zero defects policy to find root causes and eliminate the causes of those early failures and a failure to do so is a systematic failure mode. In addition IEC 61508-2:2010 Annex F table F.1 looks for “Application of proven in use process technology” and “proven in use manufacturing process” to give time for any weaknesses in the fab process to be tracked down and eliminated with additional design rules added to find suspect structures. A failure to follow the procedures from IEC 61508-2:2010 Annex F represents a systematic failure mode.
This then leaves the random failures from the useful life portion of the curve. Are these useful life failures really random?
Figure 3 - Bathtub curve showing period of useful life
IEC 61508-4:2010 contains a definition of a random hardware failure as shown below
Figure 4- random hardware failure
That’s the introduction and we finally get to the topic of the blog. Are all random failures really systematic?
There is a reference to an Exida blog given below and it contains the warning that “it’s good policy to categorize all field failures as random until proven otherwise”. For safety you must be conservative and that is considered the conservative thing to do. I presume this is on the basis that you are forced to implement diagnostics or redundancy to address the weaknesses but I’m not sure what is the reasoning behind this statement. Why should there be a bias towards assuming a failure is random. Surely systematic failures are even more dangerous since it means the entire population of the devices is susceptible to failure.
I work with semiconductors so I will continue the discussion in terms of semi-conductors, but I imagine the same arguments apply to any hardware (it is assumed software has no random failures and is therefore not relevant to this discussion). ADI (Analog Devices) have a zero-defect target and do root cause failure analysis, etcetera of any returned failing devices. If “the so called random failures” on an IC are reviewed and analysed you might find that if the track width on the IC had been wider, or the via thicker, or the metal denser at that location on the IC the failure would not have occurred. This all sounds like a systematic failure. If the part failed due to EOS (electrical overstress) then you can argue that that is either a systematic failure in the documentation provided by ADI to our customers or a systematic error because the customer used the part outside of its specified operating region.
Figure 5 - Definition of reliability from Electropedia
What about measurement noise, is that random or systematic? So, for instance if you have a 3D TOF camera and the measurement noise is given by a sigma of 10mm rms how do you allow for a measurement outside of the stated accuracy. Standards such as IEC 61496 say you must allow 5 sigma for noise. You can crunch the numbers and calculate how often an object right on that 5-sigma limit would be measured as outside the protected area due to noise. At first glance this is purely random but predictable. However, you could have chosen a different sensor with 5mm rms measurement noise and so the problem could have been eliminated by a different design choice which is almost the definition of a systematic error! (see my earlier blog here for more discussion on measurement noise).
Below is an excerpt for an unknown standard. My notes say it is IEC 63131-1, but I don’t have a copy of IEC 63131-1 so that seems unlikely and I searched various standards I did have with a one-digit difference in the number and couldn’t find it either. Anyway, the standard has an annex given over to the topic of “Differentiation between systematic failure and random hardware failure”. It has two tables for examples when it classifies failures as either systematic or random “by phenomenon” and “by cause”.
From my previous arguments I can’t agree that aging, wear-out or fatigue are random failures. I also can’t see why communication errors are described as systematic but leakage current failures as random.
Figure 6 - an excerpt from an unknown standard
Looking for more guidance in standards I found ISO 12489 which offers the following advice
Could it be that random failures are failures you can’t reproduce? Are random failures which would require an unpalatable design change e.g. move to a different but more expensive fab process.
What about this definition – random failures are the failures you get despite designing it correctly. No that doesn’t work either because not designing it correctly could indicate a lack of competence which is a systematic failure mode.
In the end here is my advice...
If you have sufficient hardware safety integrity and systematic capability your product can then be said to be suitable for use in a safety system up to SIL 3 or whatever level, you have achieved in terms of the random and systematic failure modes.
Anyway, I hope you have enjoyed my musings on what is a difficult topic. Even if the boundaries between random and systematic are not clear the whole argument may be a bit academic since IEC 61508 has techniques and measures for dealing with both. In general IEC 61508 is applied by engineers and not academics.
To learn more about the reliability of integrated circuits you can see the Analog Devices reliability handbook at https://www.analog.com/media/en/technical-documentation/user-guides/UG-311.pdf
To get a reliability prediction for any released Analog Devices integrated circuit – see http://www.analog.com/ReliabilityData. These predictions are based on accelerated testing and can be modified for any average operating temperature and while the confidence levels of 60% and 90% don’t line up perfectly with functional safety requirements the book below or the reliability link above will give you details of how this can be modified for any confidence level you like.
Other interesting reads related to this topic include:
“Reliability Maintainability and Risk” by Dr. David J Smith, see chapter 2.
Two good Exida blogs on the topic are available here and here
This one from I&E systems Pty Ltd is also relevant
Note: A planned future blog will be “Are some systematic failures actually random”, note it says "some" and not "all".