Post Go back to editing

ADAU1452 intermittently reporting strange panic codes

Category: Hardware
Product Number: ADAU1452

I have an ADAU1452 in a product that reports intermittent panic codes. This only happens at elevated temps (40C ambient) and about every 48 hours. The design is in production and this issue has only been seen on one DSP but I am trying to find out what the panic manager is trying to tell me but the datasheet doesn't elaborate on what the panic codes are. When the DSP is running without issue, a readback of both 0xf427 and 0xf428 both report 0x0000.

When the DSP reports a panic code, here is what I have found:

A readback of the Panic Code Register (0xf428) reports 0x8800 which I take to mean it is reporting "Error from software panic" and "Error in DM1 Bank 3". What do these errors mean? Also, I thought only one panic error could be reported at a time but I've got 2 bits high. What can be the cause of these errors?

Also, the panic flag in this case doesn't make much sense. A readback of the Panic Flag Register (0xf427) reports 0x0040. I take this to mean that the Panic flag is not set even though a panic code has been set. Does this make sense to anyone? I've never seen a readback from 0xf427 that was anything but 0x0000 or 0x0001 on any other ADAU1452.

I have confirmed my readback of these registers is static and correct over and over. The DSP also hardware reports that the panic code register is NOT = 0x0 by lighting an external LED attached to a GPIO.

Thanks for your time,
Joel

  • Hello Joel,

    Great work on your part of analyzing this. 

    It is strange that you are getting the 0x8800. You are correct that you should only get one code but I think there is one situation where if the error happens in the same system clock cycle it would latch in both.  recall the designer saying something like that and that the likelihood of this happening is remote. This may be a sign of some other error in something around the memory rather than the memory itself. It is still a problem and I think you have a part that has become faulty. You are only testing it at 40c which is not too hot but I hope your PCB design has plenty of vias and a large ground plane for the EP under the part to dissipate the heat away. 

    Getting back to your questions.

    The Software Panic code is for software to use to report errors. It will then populate the Software Value registers 0xF433 and 0xF434 with non-zero values. right now this is only used for reporting selfboot errors. So if selfboot fails it will stuff a value in these registers. Nothing else is using these registers at the moment. So this bit being triggered is a sign that there is a failure in the part itself. 

    For 0xF428, the other bit that is being set, the DM1 bank 3 error. These are parity errors. This is pointing to memory that is being used for DM1 parameters. The bank is an internal construct on how the memory elements are used in the part. In the ADAU1452 there are no "banks" of memory from the viewpoint of the external interface or by the core itself, but internally there are banks in the way the memory itself works in the silicon. Either way it is pointing to an error in DM1. 

    The parity error is triggered when the part reads or writes to memory and the parity bit does not agree with the actual parity of the data. This is meant to catch errors where bits are flipped so the data is no longer valid. this is meant to capture the "mean time to failure" errors where memory will eventually not hold the data and a bit might flip. This is usually measured in years or decades. It is unusual to have this happen. The only time I have seen this happen is when the program tries to access memory that has not been initialized so the parity bit and the data is random data. So it is a 50/50 chance that the random value has the parity bit set correctly. Since you have a number of parts you have been testing with, I assume to be, the same program, I doubt this is the issue. You can select an option in the Advanced Framework Options to zero out all the memory when the program loads. That might prevent this error but in your case, the program is working fine, it is this part that is not working fine. There will be other ramifications to the failure besides just the panic manager. Although, anything is possible. This could be a failure in the panic manager and the rest of the part is working perfectly. 

    I hope this explanation is clear enough. Sorry it is a bit long. 

    Dave T

  • Thanks for the quick and detailed reply Dave, it is very informative.

    The part is not being pushed to any temp extremes when this error is reported but repeated observation has shown that the slightly elevated ambient temperature is required to reproduce the intermittent panic codes. I only mentioned the temperature since it is a required condition and it is not extreme. The part is not getting hot but thanks for reminding me that it does have thermal requirements.

    For the sake of clarity, I am not self-booting this part from an EPROM. The part is programmed from a uC only once at power up.

    This DSP is an outlier in the sample size I have seen but I am trying to understand the origin of the failure as best I can in case there is something I can do to prevent it and trap for it in the future. So far I have not pinned this failure down to a single part or process but it is still a goal of mine.

    This unit has prompted me to trap for this kind of behavior in our FW and I now have a few mechanisms for detecting this kind of hardware error rather than let this kind of failure manifest itself in random ways, and causing much confusion. Basically I'm trying to make the most of dealing with this error but still curious what else I can do.

    The same program is running on many units and run in the same conditions for thousands of hours and I have only seen this error in one DSP. This DSP also very intermittently returns nonsensical register values and functions incorrectly in it's processing duties. I have been getting the sense that this sample DSP is just a bad part but that notion didn't really start to solidify until looking at the panic codes that seemed to be breaking the rules of the reporting. It seems that you get the same sense from these finds and hearing that is useful for me.

    If you have any other ideas of things to check for, registers to read, tests to try, please let me know. I have this unit running 24/7 in a loop, seeing what it will do next and testing that our FW traps it as a bad DSP and stops it from doing any damage or causing confusion in the field should this happen on another ADAU1452.

    Thanks again for your time,
    Joel