Soft Errors - Hard Facts

Soft Errors - Hard Facts

I am long overdue to include a blog on the functional safety requirements for software and indeed the functional safety requirements for Verilog code however this isn’t that blog. By soft errors I mean bit flips in RAM or FF that are not caused by hard errors and therefore disappear when the power is cycled. Previously soft errors were largely ignored and reliability predictions concentrated on hard errors but when IEC 61508-2:2010 mentioned soft errors they could no longer be ignored. This is good because in parts with significant RAM the soft error rate can easily exceed the hard error rate by three orders of magnitude. However even if parts with no RAM there can be a large number of FF and so every part will have some level of soft errors. Even analog circuits such as those using switched cap architectures can suffer from soft errors but it is largely ignored given the relative scale of the problems.

Soft errors are largely caused by alpha particles from the packaging materials and neutron particles caused by galactic sources. At ground altitudes the two contribute roughly equally. While the alpha particles cannot penetrate deeply into silicon they are coming from on top of the die so they are hard to shield against but the literature suggests polyamide can help. On the other hand, neutron particles are hard to shield against without using several meters of cement or lead. Therefore, mitigation is needed either at the CMOS device level, the module level on the IC, at the system level on the IC or at the top-level system level.

IEC 61508-7:2010 part 7 advocates a value of 1000 FIT/mega bit if you don’t have better information.  The widely accepted Siemens SN29500 series of standards advocates 1200 FIT/mega bit. In reality 1000 FIT/mega bit is widely accepted. The optimum would be to test every IC but that is still not uncontroversial as you run into issues related to the many different types of FF used in a typical part, issues related to accelerated testing vs testing on top of a mountain top and discussions over the AVF (architecture vulnerability factor) whereby many of the soft errors never propagate to create a system failure.

 

At the CMOS device level, you could use hardened devices (triple well, SOI, extra capacitance) but the most common way to cope with software errors is at the silicon block level by adding parity or ECC to RAM. A parity bit will detect if one-bit flips in a protected byte or word. However, it cannot detect if two bits flip. If parity is combined with the physical separation of logically contiguous bits then this problem is overcome as one particle should no longer flip two bits in the same word. ECC on the other-hand can typically detect all one-bit, two-bit errors and most higher bit errors. The big advantage of ECC over parity is in fact that it can recover from one-bit errors with no intervention required. For parity errors it is generally required to reboot the system to clear the error but it depends on the end application. If either ECC or parity is not available the application designer can still mitigate against soft errors by storing critical values in two memory locations and comparing the results before using the values. This however tends to mess up the application code. Other options including using a two-channel system with comparison. This is somewhat similar to a CAT 3 or CAT 4 architecture from ISO 13849 as it is typically drawn. A dual core lockstep architecture achieves similar benefits.

 

Parts such as the ADSP-CM417F from ADI facilitates several of the above solutions. The on-chip RAM has ECC and physical separation, the RAM is built from multiple separate 32k blocks and it contains two cores with evidence of sufficient separation available. While parts such as the  a the AD7124 ( 24 bit sigma-delta ADC) contain an on-chip state machine which at the end of the configuration state stores a golden CRC and thereafter the state machine recalculates the CRC at an interval of less than 500uS to check if any of the configuration bits have flipped. Both of these also illustrate the value of Safety Datasheets whereby the end user gets extra information to facilitate doing a safety analysis e.g. information on the physical separation of the logically contiguous bits in a RAM, the fact a RAM isn’t implemented as one big block but rather several smaller blocks…..

The most famous recent case of soft errors came from automotive where it was mooted that a single bit flip could cause unintended acceleration. However other cases have included voting machine errors and electricity sub-stations shutting down.

If I had time I would have discussed soft errors and FPGA but perhaps that will do for another day.

This video is inappropriate in the sense that neutrinos don’t really interact with matter but it is impressive and it shows that high energy particles are passing through all matter – see https://www.youtube.com/watch?v=EY-WH62IXbM I though this video used to be longer but you get the idea, perhaps the longer version has been taken down.

This week there is a bonus video explaining how Tesla have decided to be radiant tolerant instead of radiation hardened – see https://www.youtube.com/watch?v=N5faA2MZ6jY It is also a good example of taking care of safety at the system level.

Good books on the topic of soft errors if you want to learn more include

1) “Soft errors in modern Electronic systems”

2) “Architecture design for soft errors”

For next time, the discussion will be on the “PFH and PFD”.