2009-03-13 18:43:35     POST memory test crashes on some hardware

Document created by Aaronwu Employee on Sep 18, 2013
Version 1Show Document
  • View in full screen mode

2009-03-13 18:43:35     POST memory test crashes on some hardware

Steve Strobel (UNITED STATES)

Message: 70972   

 

I am working with three different custom BF537 hardware designs, all with the same 64MB of SDRAM (two MT48LC32M8A2TG-75 chips, I think the same as are in the BF537 Stamp) and all running the same version of U-Boot.  One design has longer data and address bus traces than the other two, which we think are contributing to occasional errors.  We have been testing using U-Boot's "mtest" command, sometimes letting the units run overnight and sometimes temperature cycling them (up to 70 degrees C) while watching for errors.  The two designs with shorter bus traces work consistently in all cases.  On units of the other design, we been able to make "mtest" work consistently in all of our tests by changing the value of series termination resistors on the data and address busses and control signals.  The SDRAM controller is configured as follows:

 

// From include/configs/our_board.h:

 

#define CONFIG_EBIU_SDRRC_VAL 0x306

#define CONFIG_EBIU_SDGCTL_VAL 0x91114d

#define CONFIG_EBIU_SDBCTL_VAL 0x25

 

In an effort to more thoroughly (and quickly) test the changes we have been making, I enabled the POST memory test in U-Boot.  After fiddling with including some header files to get it to compile, I got it to run, then changed the order of the tests so it goes from slowest to fastest rather than the other way around.  On the two designs with short busses, I get the following output:

 

U-Boot 1.1.6-svn74 (ADI-2008R2-pre) (Mar 12 2009 - 14:54:24)

 

CPU:   ADSP bf537-0.2 (Detected Rev: 0.3)

Board: Link Communications, Inc. RLC-DSP4

       Support: http://www.link-comm.com/

Clock: VCO: 500 MHz, Core: 500 MHz, System: 100 MHz

RAM:   64 MB

 

CCLK-100Mhz SCLK- 25Mhz:    Writing...Reading...OK

CCLK-100Mhz SCLK- 50Mhz:    Writing...Reading...OK

CCLK-100Mhz SCLK-100Mhz:    Writing...Reading...OK

1-µ200Mhz SCLK- 40Mhz:    Writing...Reading...OK

CCLK-200Mhz SCLK- 50Mhz:    Writing...Reading...OK

CCLK-200Mhz SCLK-100Mhz:    Writing...Reading...OK

1-µ400Mhz SCLK- 50Mhz:    Writing...Reading...OK

CCLK-400Mhz SCLK- 80Mhz:    Writing...Reading...OK

CCLK-400Mhz SCLK-100Mhz:    Writing...Reading...OK

CCLK-500Mhz SCLK- 50Mhz:    Writing...Reading...OK

CCLK-500Mhz SCLK-100Mhz:    Writing...Reading...OK

CCLK-500Mhz SCLK-125Mhz:    Writing...Reading...OK

 

memory POST passed

 

I think that the garbled characters occur when it switches CCLK rates while printing messages out the serial port;  not a big deal.  When I run the same version of U-Boot on the design with longer busses, it prints the following:

 

U-Boot 1.1.6-svn74 (ADI-2008R2-pre) (Mar 12 2009 - 14:54:24)

 

CPU:   ADSP bf537-0.2 (Detected Rev: 0.2)

Board: Link Communications, Inc. RLC-DSP4

       Support: http://www.link-comm.com/

Clock: VCO: 500 MHz, Core: 500 MHz, System: 100 MHz

RAM:   64 MB

x€xx€xx€xxxxxx€€ø€xøøxøøxxxøxøx€øxøx€x€€xx€x€øxxxxxxøxxx€x€øxøx€øxøxxx€€xøxxxxx€x€øxøx€x€xx€xø€xxxxxxxxx€xxxx€x€ø€xxxxxx

[snipped about 20 more lines of jibberish]

 

Looking at the serial output with an oscilloscope, I discovered that the baud rate was changing from 115200 baud to about 28800 baud. Setting a terminal program for that baud rate, I was able to capture the following:

 

Ack! Something bad happened to the Blackfin!

 

SEQUENCER STATUS:

SEQSTAT: 00000021  IPEND: 3fc00b2  SYSCFG: 0032

  HWERRCAUSE: 0x0

  EXCAUSE   : 0x21

  physical IVG7 asserted : <0x03fc0470> { _evt_default + 0x0 }

RETE: <0x87104c40> { ___smulsi3_highpart + 0x831313c4 }

RETN: <0x0042887a> /* unknown address */

RETX: <0x03fd079c> { _post_init_pll + 0x30 }

RETS: <0x03fd0992> { _memory_post_test + 0x7a }

PC  : <0x03fc00b2> { _start + 0xb2 }

DCPLB_FAULT_ADDR: <0xffc0000c> { __etext_l1 + 0x1fffd8 }

ICPLB_FAULT_ADDR: <0x03fd079c> { _post_init_pll + 0x30 }

 

PROCESSOR STATE:

R0 : 00000022    R1 : 00000004    R2 : 0000ffbf    R3 : 00000004

R4 : 00000001    R5 : 00000003    R6 : 00000002    R7 : 017d7840

P0 : ffc00424    P1 : 00000058    P2 : ffc0000c    P3 : 03f5bedc

P4 : 0000000b    P5 : 03f5c000    FP : 03f5bf74    SP : 03f5bd30

LB0: 03fd3844    LT0: 03fd3838    LC0: 00000000

LB1: 03fce12a    LT1: 03fce128    LC1: 00000000

B0 : dfab8fcd    L0 : 00000000    M0 : 00000000    I0 : 0001c200

B1 : 6d72dae4    L1 : 00000000    M1 : 00000000    I1 : 00000400

B2 : fc6b79c6    L2 : 00000000    M2 : ff807ffc    I2 : e08ae4ca

B3 : bf7cc0e2    L3 : 00000000    M3 : 00000000    I3 : d8e8c6d4

A0.w: 00000040   A0.x: 00000000   A1.w: 00000040   A1.x: 00000000

USP : a8aae4c2  ASTAT: 00000000

 

Hardware Trace:

   0 Target : <0x03fc0940> { _bfin_panic + 0x0 }

     Source : <0x03fc0b14> { _trap_c + 0x198 }

   1 Target : <0x03fc0b0a> { _trap_c + 0x18e }

     Source : <0x03fc0996> { _trap_c + 0x1a }

   2 Target : <0x03fc097c> { _trap_c + 0x0 }

     Source : <0x03fc0416> { _trap + 0x56 }

   3 Target : <0x03fc03c0> { _trap + 0x0 }

     Source : <0x03fd079a> { _post_init_pll + 0x2e }

 

Please reset the board

 

### ERROR ### Please RESET the board ###

 

 

 

The fact that this problem occurs only on one hardware design (we have tested at least four units of that design, with similar results) originally made me think that the memory interface on that design might still be marginal, but I can't get it to fail under any other conditions.  I copied the guts of the POST memory test and ran them before the routines that modify the clock rates and can't get them to fail.  I tried some other tests to stress the memory;  they work fine too.  So now I am wondering if there is something about that hardware design that prevents the code used to change the clock rates from working.  It appears from the stack dump that the error occurs while executing the post_init_pll() function, which is pretty simple:

 

void post_init_pll(int mult, int div)

{

    *pSIC_IWR = 0x01;

    *pPLL_CTL = (mult << 9);

    *pPLL_DIV = div;                 // printing messages stops after this, either because of

 

                                             // the crash or because the baud rate changes

    asm("CLI R2;");

    asm("IDLE;");

    asm("STI R2;");

    while (!(*pPLL_STAT & 0x20)) ;

}

 

I added some print statements to try to determine how far it gets, but the changing clock rates make it hard to tell.  I haven't tried to decipher the disassembled output;  that would be slow going for me.  I have two questions that I would love to have some input about:

 

#1:  Is it likely that the crash is due to memory problems rather than something PLL related?

 

#2:  If it is a PLL issue, is it likely to cause trouble in the field, or can we safely ignore it?  We don't ever adjust the clock rate or go to any low power modes.

 

Thanks for any opinions, pointers, etc.

 

Steve

Attachments

    Outcomes