by Kiran Hoovinalli
The design engineers prefer to use the same architecture for redundant elements to match their electrical characteristics with the main element. This also saves time and reduces design complexity. However, when the same product is used in a critical safety application, e.g., for monitoring battery cell temperature in an electric car, the recommendation is to use diverse architecture for redundant elements to prevent common cause failures. However, in many cases, there is a requirement to implement homogeneous redundant elements that perform similar functions in parallel to each other. In these cases, it becomes important to consider preventive measures to avoid common cause failures. ISO26262 Chapter 11 contains many recommended measures to follow, the most important of which is the physical separation between homogeneous redundant elements on the semiconductor chip.
Functional Safety Concept of Using Redundant Elements in a Semiconductor Chip
In functional safety, the most used safety concept for detecting failure modes in an element A1 is adding a homogeneous redundant element A2 with the same function and comparing their outputs by a comparator (Figure 1). A few examples in semiconductor ASIL products include dual lock step cores for micro-controllers, redundant measurement channels, redundant bandgap etc.
Figure 1- Homogenous redundancy using comparator
Although it is easier to implement this safety concept by adding the same element again in the design architecture, it is difficult to prevent common cause failures that can lead to the same wrong output from both the main & redundant elements (Figure 2). Since the outputs of both the elements are affected in a similar way, this common failure mode may not be detected by the output comparator. The common causes of failures can be due to random failures in shared resources like common power supply, common clock, etc, or due to environmental factors like temperature, package stress, EMI, vibration, etc, or due to random physical root causes like local defects, oxide breakdown, latch-up, local heating, etc, for semiconductor ICs. Even systematic failures between the two elements may be relevant if they affect both the hardware elements simultaneously.
Figure 2 - Main and redundant elements
Cause for Stress-Related Failures in a Semiconductor Chip
Many studies have found that the plastic package of semiconductor chips causes stress-related failures such as cracked passivation, metal deformation and delamination, cracked chips, cracked packages, and parameter shifts. The large mismatch in the coefficient of thermal expansion between the silicon chip and the plastic encapsulant is the major contributor to these failures. The package stress is more at the centre of the die, gradually decreasing towards the edges.
Measures to Prevent Common Cause Failures:
Common-cause failures due to shared resources may be prevented by adding monitors to check for failure of shared resources, like an overvoltage/undervoltage monitor for the power supply (Figure 3). Use separate resources like a power supply or a separate clock for the redundant element A2 if possible. Also, avoid using a shared resource (Figure 4).
Figure 3 - Overvoltage/undervoltage monitor for the power supply
Figure 4 - Shared resource
Physical separation between homogeneous HW elements prevents the occurrence of common cause failures due to environmental factors and random physical root causes on the semiconductor chip. Physical separation means the redundant elements need to be placed away from each other on the die, and their internal routings may not overlap (Figure 5). Also, dedicated guard rings can be added around each of the elements. The redundant elements, mainly the bandgaps, are subjected to different package stress levels when placed at different die locations (Figure 6). This will prevent both bandgap reference voltages from drifting similarly.
The placement of redundant elements should be carefully considered in the layout floor plan. The final layout must also be reviewed during the dependent failure analysis to check whether the separation is done correctly.
Figure 5 - Semiconductor die
Figure 6 - Different die locations
Conclusion:
In real life, we usually don’t keep important documents and their copies in the same place inside our homes. Similarly, redundant elements need to be placed at different locations on the chip, addressing functional safety concerns that help to eliminate common cause failures.
Read more from the Automotive FuSa blog series