Given I work for an IC manufacturer I was surprised that I hadn’t already done a blog on the reliability predictions for integrated circuits. Having good reliability is one of the 3 pillars of functional safety and so reliability predictions are very important if for nothing else to allow a comparison between different architectures but IEC 61508 does have mandatory values for the probability of dangerous failures per hour and to meet this you need reliability predictions. However before someone comments, it is possible to design a safe system using unreliable components by using redundancy and diagnostics and I touched on this in a recent blog. It is also possible to design an unsafe system using reliable components, but I digress. Nevertheless, as I say reliability along with HFT (hardware fault tolerance), SFF (safe failure fraction) and taking measures to prevent the introduction of design errors are the pillars of functional safety.
Figure 1 a bathtub curve from the Analog Devices reliability handbook
The reliability for many items follows a shape often called a bathtub curve as shown above. When first powered on the failure rate can be high and once the lifetime of the item is exceeded the failure rate is once again high but in the middle of its life the reliability reaches a steady failure rate, and this is the number normally used for reliability predictions. The elevated period at the start might be 48 hours or less and Analog Devices do ELF (early life failure) testing to measure this failure rate and debug their manufacturing process to eliminate such failures. The wear out phase for a semiconductor is typically 20 years or more but depends on the mission profile (amount of time spent at different temperatures). Testing to ensure this lifetime is also carried out, HTOL – high temperature operating life, along with the use of DRC (design rule checkers) and other tools.
The easiest way to get a reliability prediction for any Analog Devices IC is to go to www.analog.com/ReliabilityData and enter your IC details which will give you a report such as the below. You can also get a reliability number for a given process node e.g., 0.18u (which is still important for Analog ICs).
Criticisms of this data is that it only includes random hardware failures and does not include failures due to systematic causes. This criticism is valid because if ADI find a systematic failure during HTOL testing then it is fixed, and the failure source eliminated. Therefore, data based on HTOL will not include failures due to misapplication of the device by our customers, failures due to either miss-reading or inaccurate datasheets, failure due to not correctly protecting the device from electric overstress etc. However, I am fine with this as I believe we should be only using data containing the random hardware failure rate because random and systematic failures are handled differently in functional safety standards such as IEC 61508.
Figure 2 - Reliability report from www.analog.com/ReliabilityData
The reliability data is quoted in FIT (failure in time) which is the number of failures to be expected in an operating time of 1 billion years. This unit is used when something has a very high reliability as it gives easy to interpret numbers such as 10 FIT or 100 FIT as opposed to 1e-8/h and 1e-7/h. If something has a FIT of 1 that is not to say that the device has a lifetime of 1 billion years but rather if you have 1 million devices running for 1000 hours you can expect one failure due to random hardware failure issues if the FIT is 1.
The FIT is actually given at a 60% confidence level and a 90% confidence level. Functional safety standards often look for a figure at a 70% or other confidence level and I previously did a blog on how to convert from one confidence level to another, see here. The tool only allows you to enter an average operating temperature and while this is generally acceptable for industrial applications our automotive colleagues use more impressive mission profiles to reflect the higher failure rates at higher temperatures. In these cases, the above predictions can be fed into an Excel spreadsheet and use the Arrhenius equation to solve the reliability number for an arbitrary mission profile.
If the product is a multi-die solution such as often used in our digital isolators (an opto-coupler replacement) then the prediction will be given for each of the die in the package.
The data in the table is based on HTOL (high temperature operating life) testing which is a form or accelerated testing done in a burnin oven. Further it not only includes the data for itself but typically for every part submitted to HTOL testing by ADI on that process node. A list of those parts and the test temperature (typically 125’c or 150’c to get an acceleration factor) is also given. The use of such substitution data is well accepted. If you are on a new process node for ADI, then there won’t be a lot of previously completed quals from which to gather data and to get a reliability prediction at the 60% confidence level the quoted FIT can be very high. As more and more parts complete HTOL testing on the process node the confidence grows and the quoted FIT will fall provided no failures are found of course.
If you don’t want to use the ADI data, then a common source of reliability predictions is to use IEC 62380. The formula to predict die reliability using IEC 62380 is shown below and using the formula can be difficult for our customers because you might not know things like the transistor count.
Figure 3 - equation for a reliability prediction according to IEC 62380
The equation is not as bad as it looks and can be encapsulated into a spreadsheet with a sample from such a spread sheet shown below.
Figure 4 - a snapshot from a reliability prediction according to IEC 62380
IEC 62380 allows a much more granular mission profile to be entered. In the example above the total lifetime is 7920 hours (less than 1 year which is typical for automotive) and includes 480 hours at -20’C, 1600 hours at 23’C, 5200 hours at 60’C…..IEC 62380 requires the package reliability to be calculated separately. Unfortunately, IEC 62380 has been obsoleted but our colleagues in the automotive ISO 26262 committee obviously love it as much as I do because they took the relevant text for an IC (integrated circuit) reliability prediction and pasted into directly into ISO 26262-11:2018 to make this calculation still available. I think IEC 61508 next revision should also take such an approach.
The next most popular reliability prediction method that I see is SN29500. This is a standard you can buy from Siemens, but I don’t have the exact source handy having bought my copy many years ago. SN29500 and IEC 62380 are based on industry feedback and include a mix or random hardware failures and failures due to systematic failure modes including misapplication of the ICs leading to EOS (electrical overstress) etc. Personally, I think this mixing of random and systematic failures is wrong but unfortunately, I seem to be in a minority. One of the early drafts of ISO 26262-11:2018 stated that the confidence level for IEC 62380 and SN29500 was at the 99% confidence level but I don’t see this in the published version. The reliability figures quoted in ISO 13849 are based on SN29500 and I believe assume an average operating temperature of 45’C (If this is not removed then I didn’t get a chance to open ISO 13849 and confirm this).
The values from www.analog.com/ReliabilityData, IEC 62380 and SN29500 often different by an order of magnitude. However, if a consistent source of reliability predictions is used then the figures are excellent for comparing architectures. In a real design diagnostics and hardware fault tolerance mean that the eventual PFH/PFD for an entire safety function could be less than the FIT for an individual IC used in the safety function.
ISO 26262 advocates the use of IEC 62380 and SN29500 for semiconductors and gives the rules for the calculations in part 11 of the standard. IEC 61508 is not as clear. The guidance is probably biased by the fact that a lot of the early people on the committee came from the process control industries
Figure 5 - guidance from IEC 61508 on the source of reliability data
ISO 13849 (functional safety for machinery) does prioritize the use of manufacturers data with item b) below using data based on SN29500.
Figure 6 - guidance on the source of reliability data from ISO 13849
Most experts will say you cannot mix data from various sources but I think in practice you must and this is supported by the below excerpt from a book I have quoted in a lot of recent blogs called “Reliability assessment of safety and production systems”.
Figure 7 From the book "Reliability assessment of safety and production systems"
SN29500 makes it very clear that integration is good from a hardware reliability point of view. So that there is a FIT of 70 for a die with 500k transistors and it increases to approx. 80 if the number of transistors goes to 5 million. Therefore, having one big die with 5 million transistors gives a much lower reliability number than 10 separate die each with 500k.
There were lots of interesting tangents I could have taken while writing the above, but I try to keep these blogs less than 2000 words. I will hopefully cover some of those tangents in future blogs.
The ADI reliability handbook available here gives a more theoretical explanation of the topic.
Relevant previous blogs in this series includes