Post Go back to editing

Currents and voltages read from ADE7758 randomly "freezing", fixes only with software reset

Category: Hardware
Product Number: ADE7758

This is a very similar issue as in this post -  ADE7758 voltage reading hangs on contactor swtching...  sadly, there is no answer...

In our application we have a STM32 microcontroller talking to an ADE7758 chip via ADuM1411Bxxx isolator. ADE7758 is the only device on the SPI bus, but MCU is driving CS pin. The chip is used to measure 3-phase currents, voltages and energies of a large industrial pumps, which are started using contactors (the contactors are driven by relays which are on the same PCB as ADE7758 and MCU). Voltages are connected directly to the ADE7758 (obviously with proper dividers) and current is measured using transducers. SPI clock is 2.25 MHz. For every register that is read, firmware also checks the checksum from CHKSUM register. This all works well in hundreds of devices and for hundreds of hours. However sometimes the readouts of voltage go to some insane levels (which are above full-scale of the device) and current readouts fall to very low values, while everything else seems to be working. When this problem appears, the MCU can read all registers of ADE7758, every single read passes the checksum test and all values are reasonable (except voltage and current) - for example our logs show that during this invalid state the energy is still accumulated, however much slower than previously (due to very low current readouts). As an example, before the issue appears, the current I'm reading from phase A (AIRMS register) is 160945 while voltage for this phase (AVRMS register) is 732381. After the problem starts, the current readouts drop to values around 100 and they are different for each phase and for each readout (they are fluctuating around +/- 15%, but the values are invalid). On the other hand, invalid voltages spike to an insane level of 6289664. I read this exact same value for every phase and every readout - it seems this value is frozen. In this stage energy registers still seem to accumulate something very slowly, as they increase by 1 every couple of seconds (while before the issue they increase by ~2000 every second). What is interesting is that during the issue, STATUS register frequently reports "missing zero crossing" for all 3 phases, while VARCFNUM register reports "reverse polarity reactive power measurement" in bits 13-15. In the device the connections of power lines are solid, no way they could get disconnected.

To avoid this issue, we added a periodic software reset of the ADE7758 chip - every 1 minute the firmware will issue a "software reset" command, wait for required amount of time and resume normal operation. Initially this seemed to fix the issue, but still in some sites the problem appears, however now its occurrence is limited to no more than 1 minute. This periodic reset is not a solution and it has some issues of its own (a couple of reads after reset are way off the expected values), so we would really like to understand what is the issue here and how could we really fix it. What we noticed so far is that it seems to be somehow noise related, as with disconnected load (no motor) we cannot catch it. The SPI communication seems to be OK - as I've mentioned, for each read the firmware verifies the checksum and this verification passes each time - however reducing SPI frequency to ~300 kHz somehow magically makes the issue go away (even though both ADE and isolator are capable of data rates up to 10 Mbps, so 2.25 is way below their limit).

We would be very grateful for some pointers what to look for, as the issue is extremely hard to reproduce (seems to be impossible to reproduce in the lab) and happens rarely.

Thanks in advance!

Parents
  • If noise gets  to the crystal pins or sclk data, corruption can happen. Spi reads can turn to writes and you really don't know where the write went. 

    This is an immunity problem. PCB layout has a lot to do with this. CMTI (common mode transient immunity) can also be an issue. The isolation family you chose has 25Kv/usec 

    This family has 100Kv/usec

    https://www.analog.com/media/en/technical-documentation/data-sheets/ADuM140D_140E_141D_141E_142D_142E.pdf

    You should find a pin for pin compatible part.

    If you have a meter with a problem can you read back all the registers to see what happened before you software reset?

    Without knowing what is happening (register contents all registers in the datasheet) it will be hard to figure out. 

    Dave

  • If noise gets  to the crystal pins or sclk data, corruption can happen. Spi reads can turn to writes and you really don't know where the write went. 

    True, this is possible, but I would personally consider it extremely unlikely. All reads are done with a checksum, in the log we are doing right now I see whether there are any checksum errors and they sometimes happen, but for example 30 minutes before the problem appears, with no changes to writable registers. The logger reads all available ADE7758 registers and the writable ones stay exactly the same for the whole duration - when there are no checksum errors, when I catch a few, when the problem starts, when it ends - there is zero change to any configuration registers of ADE7758. Also in a situation like this I would expect the read to turn to a write-all-zeroes, so a change to a writable register would have to be significant and very visible.

    I've also tried causing the "noise" manually and tried shorting the 10 MHz crystal next to the chip. The symptoms were slightly different from what we observe when the real issue happens. As I wrote earlier, during the issue all voltage registers are frozen with the same, very large value, and current registers are very small but fluctuating. If I short the 10 MHz crystal of ADE7758. then all current and voltage registers become frozen with the last value that ADC was converting (all 6 are the same, but the value they hold is "reasonable"). If I stop shorting the crystal, the situation gets back to normal immediately, while the issue we face in real application never resolves by itself, a software reset is needed. I've also tried shorting the Vref output, but this also just causes Vref to drop and then makes all readings very small, however it also resolves as soon as I stop shorting the pin.

    If you have a meter with a problem can you read back all the registers to see what happened before you software reset?

    That's the weirdest thing - nothing. All the writable registers stay at exactly the same value as previously and all of them have expected values. The only registers that change are the voltage (frozen with same value, very large), current (fluctuating with 3 different very small values), energy (increasing slowly), status (frequently reports zero-crossing timeout and stops reporting zero-cross detection) and the 3 bits of VARCFNUM (reverse polarity of reactive power) - everything else stays exactly the same as it was.

    During the weekend we have found that it somehow is related to the communication. The application has SPI frequency of 2.25 MHz. If we increase that to 4.5 MHz, the problem seems to appear more often (a few times per hour). If we decrease SPI frequency to 281 kHz, the problem did not appear for 3 days straight. If we would be getting random errors, then communication would be an obvious suspect, but in our case it really looks as if the ADE7758 got partially frozen, because the firmware can read any register it wishes, with checksum, everything matches, but the voltages and current are completely wrong.

    As you probably can tell, we are already desperate (;

Reply
  • If noise gets  to the crystal pins or sclk data, corruption can happen. Spi reads can turn to writes and you really don't know where the write went. 

    True, this is possible, but I would personally consider it extremely unlikely. All reads are done with a checksum, in the log we are doing right now I see whether there are any checksum errors and they sometimes happen, but for example 30 minutes before the problem appears, with no changes to writable registers. The logger reads all available ADE7758 registers and the writable ones stay exactly the same for the whole duration - when there are no checksum errors, when I catch a few, when the problem starts, when it ends - there is zero change to any configuration registers of ADE7758. Also in a situation like this I would expect the read to turn to a write-all-zeroes, so a change to a writable register would have to be significant and very visible.

    I've also tried causing the "noise" manually and tried shorting the 10 MHz crystal next to the chip. The symptoms were slightly different from what we observe when the real issue happens. As I wrote earlier, during the issue all voltage registers are frozen with the same, very large value, and current registers are very small but fluctuating. If I short the 10 MHz crystal of ADE7758. then all current and voltage registers become frozen with the last value that ADC was converting (all 6 are the same, but the value they hold is "reasonable"). If I stop shorting the crystal, the situation gets back to normal immediately, while the issue we face in real application never resolves by itself, a software reset is needed. I've also tried shorting the Vref output, but this also just causes Vref to drop and then makes all readings very small, however it also resolves as soon as I stop shorting the pin.

    If you have a meter with a problem can you read back all the registers to see what happened before you software reset?

    That's the weirdest thing - nothing. All the writable registers stay at exactly the same value as previously and all of them have expected values. The only registers that change are the voltage (frozen with same value, very large), current (fluctuating with 3 different very small values), energy (increasing slowly), status (frequently reports zero-crossing timeout and stops reporting zero-cross detection) and the 3 bits of VARCFNUM (reverse polarity of reactive power) - everything else stays exactly the same as it was.

    During the weekend we have found that it somehow is related to the communication. The application has SPI frequency of 2.25 MHz. If we increase that to 4.5 MHz, the problem seems to appear more often (a few times per hour). If we decrease SPI frequency to 281 kHz, the problem did not appear for 3 days straight. If we would be getting random errors, then communication would be an obvious suspect, but in our case it really looks as if the ADE7758 got partially frozen, because the firmware can read any register it wishes, with checksum, everything matches, but the voltages and current are completely wrong.

    As you probably can tell, we are already desperate (;

Children
  • You need to read all registers not just the ones you write. 

    If you can try a different isolator as I mentioned. 

    Can you share you layout? 

    Dave

  • You need to read all registers not just the ones you write. 

    I am reading all registers that exist in ADE7758 chip and there is no change in value when the problem starts, except for the ones I listed above. Sorry if that was not clear.

    I'll get the answer to the question about the isolator and the layout from people in charge of that (I'm just the firmware guy (; ).

  • Hello again Dave!

    Unfortunately I'm not allowed to share the layout of the PCB or schematic, however I was assured that it is done according to AD requirements from the datasheet. So this is all I can tell...

    As for the tests we do, here is a slightly shortened log of what we have

    /cfs-file/__key/communityserver-discussions-components-files/357/voltage_2D00_spike.zip

    The columns contain: timestamp, calculated average current and voltage, max number of consecutive SPI transfer retries and all existing ADE7758 registers, read one by one each time.

    In this spreadsheet you will find three blocks of data that I consider interesting. Whenever the "max transfer retries" changes (increases), this means that there were a few checksum errors during transfer and the operation had to be retried several times in a row. Generally when this number changes, it means that somewhere in this line there was some glitch and the firmware had to retry the read several times in a row. There are two such blocks of data where the number of retries changes (exact spot marked with yellow background), however as you may see in the log of registers - no writable register changes. The last block of data is when the ADE7758 "freezes" - the rows which are affected are marked with yellow background. The most interesting values which change/freeze are marked with red background - voltages, currents, Status, RSTATUS and VARCFNUM (it is possible that frequency and temperature readings are also frozen, but it is hard to tell that with 100% confidence). This "freeze" will go on until the firmware issues a software reset to ADE7758 chip - it will never fix itself. As you see, in this block of data also no configurable register changes its value, and in fact every register (except for those affected) has exactly the same (correct) value as previously.

    During our experiments we have found that it is somehow related to the SPI frequency, as when using 1.1 MHz or above, the problem occurs regularly, and by increasing the frequency of SPI we get increased rate of occurrence for that issue. However for SPI frequency 500 kHz and below, the chip seems to work fine for at least a day without any issues (the longest we tried was 3 days straight with ~250 kHz SPI frequency). However if the issue would be related to SPI glitches, I would expect there to be some random values for random registers from time to time, yet the whole (full) log looks perfectly fine to me, and the issue affects only voltage and current registers, which DO pass checksum tests each time...

  • If you can share the schematic and layout to my private email this would be better. The datasheet does not have suggested layout guideline so The layout may have issues that I could point out. 

    The most important is the clkin pin of the crystal. Where is the loading cap? It should be at the clkin pin as close as possible.

    The AGND and the DGND pins should be tied together to a plane. crystal load caps tooo the same plane. This will reduce ground bounce that may cause timing violations. Since the spi port runs of the crystal this might be why your issue seems spi clk speed dependent.

    The issue could still be the spi clk and data in pin.  if these signals have noise on them the wrong data could be received. usually adding a low pass filter on the spi lines helps. 

    Adding a cap at the ADE7758 and a resistor at the micro will reduce the noise on the spi lines. 

    Adding the same filter on the data out line to the micro would also be suggested. Do this by adding the cap at the micro and the resistor at the ADE7758. 

    The cap at the input pins of the spi reduces the impedance of the input pins reducing the high freq noise. Sometime a cap alone is useful it really depends on spi spead. 

    Dave