2009-03-31 11:08:55 Different behaviour of the same system image on CM-BF537E
Jean-Francois Argentino (FRANCE)
Message: 71887
Hello
I'm facing a strange problem and I don't know where to start. Let me explain:
I have a linux system image (based on the 2008R1.5 version) which works great on a CM-BF537E board.
The fact is that, when I load exactly the same system image on another CM-BF537E (same board and DSP revision), it doesn't work! The system boot fine, my application start but after a random time, the DSP doesn't see any edge change on GPIO PF7, without any error anywhere (but the edge change is here I can see it with an oscilloscope). If I restart my application, sometimes it run fine, sometimes it stop after a while, sometimes it totally freeze the system!
At the beginning, I was thinking about a defective board, but for 4 boards, two work fine, and 2 have this strange behaviour.
Some precisions that could be useful:
They all have the same u-boot image (u-boot-cm-bf537e-bypass-2008R1.5), with the same u-boot environment. They all have the same boot process: downloading the image from tftp @0x01000000, bootm 0x01000000 with the same bootargs.
On one of the "defective" board, I've downloaded the same image twice at two different memory addresses to compare them with cmp.b, it founds no difference. I just start a mtest on the other "defective" board...
As I don't know what to do, I will try to boot a "defective" board from flash memory and see the result. But if anybody have an idea of what the problem can be, or where I can start to search / debug, I'll be really happy to read his/her suggestions...
Thank you.
TranslateQuoteReplyEditDelete
2009-03-31 13:56:27 Re: Different behaviour of the same system image on CM-BF537E
Robin Getz (UNITED STATES)
Message: 71893
Jean-Francois:
What version of DSP is on the boards, and what did you build the kernel for? (the kernel should print this out on start).
-Robin
QuoteReplyEditDelete
2009-04-01 03:54:06 Re: Different behaviour of the same system image on CM-BF537E
Jean-Francois Argentino (FRANCE)
Message: 71929
Hello,
Thank you for your answer. All my boards have a 0.3 DSP revision, and the kernel is compiled for the 0.3 revision. That's true that the u-boot I'm using is compiled for 0.2 revision, but it was said to me that it doesn't really matter.
But, I think that I'll displace this thread into the hardware section because on 10 boards only one works good, and this is the one I've developped on! Maybe the board I've use for development is more permissive to electro-magnetic distubance... To understand why I'm saying that, I have to explain a little more how my application works:
A FIFO continuously fill with digital datas (for the validation we use simple counter, thus its easy to verify to the other side that we aren't missing any data). When the FIFO is half full, it fell a signal connected to a bfin GPIO, which "wake-up" and start a PPI transfer of 4096x16 bits datas, and send them over ETHERNET. The bfin continue the PPI reading while the half-full signal is down. There's a time-out on the GPIO interrupt, because sometime I see that LINUX can miss some. So if a timeout occured, the bfin read the half-full state, and if it's down it start a PPI transfer.
This method works great on the board I've developped on, I've made more than 10 tests of more than 24hours each without any errors, with a data rate of about 6MBytes per seconds.
On the others 9 boards, I observe 2 differents behaviours:
-everything work good for N transfers, then the PPI dma transfer never end (or never rise the end transfer interrupt). If I kill my application, and restart it, it works for about the same amount of transfers and stop again. This number of transfers depends on the board, some works for 50k transfers, others until millions before being blocked,
-the second behaviour is totally different, the FIFO half full GPIO interrupt never rise, so my application detect a timeout, then start the transfer, the problem is that, due to the FIFO size, my application can afford something like 3 or 4 successive timeouts before a data overwrite occured in the FIFO. So the application never stops but the data are overwrotten.
And sometimes, but it's really uncommon, the system just freeze.
I'm thinking about an EM problem because, on the board which works, we use heatsink on the bfin, and if we don't connect it to the electrical ground and touch it with our finger, we can see the second behaviour I just describe (no GPIO interrupt). And on one of the board that runs then stops (1st behaviour), I install a heatsink, and now it shows me the second behaviour...
I'll try to slow down system and core clocks to see if it changes something. And I'll check all the bfin IO to see if one could be an input not pulled up... But I'm really open to any suggestion before closing this thread to open it in the hardware section.
Thank you to taking the time to read the long complain of a poor programmer (with a bad english) which discover the elecronical nightmare.
TranslateQuoteReplyEditDelete
2009-04-01 10:17:09 Re: Different behaviour of the same system image on CM-BF537E
Robin Getz (UNITED STATES)
Message: 71959
Jean-Francois:
It is more likely that it is a SDRAM timing issue on your hardware.
I need to finish a few things, and then should be able to do a low level SDRAM tester this week for you.
-Robin
QuoteReplyEditDelete
2009-04-01 11:26:22 Re: Different behaviour of the same system image on CM-BF537E
Jean-Francois Argentino (FRANCE)
Message: 71963
Thank you Robin but I hope I just found what the problem is...
In fact the PPI driver always request the PF8 as the FSYNC2 signal, but our design doesn't use it, so I think that EM polution make spurious signal on this pin. In fact, if I don't ask for the PF8 pin into the PPI driver, it works for more than 1 hour on a board where it didn't work yesterday...
Luckily (or unluckily for my poor heart), the board I made the development on looks far more insensible to EM polution than the others, or maybe I just burn the PF8 input while "debugging" our hardware platform...
I'll run a test all this night long on several boards and post the result here.
I hope this is my penultimate post to this thread!
Thank you again.
TranslateQuoteReplyEditDelete
2009-04-02 06:25:34 Re: Different behaviour of the same system image on CM-BF537E
Jean-Francois Argentino (FRANCE)
Message: 72036
Well, it looks like it is not my last post on this thread... My application just stop during the night, the difference is that it stops after a couple of hours instead of a couple of minutes...
I'll review our hardware design with an electronician, to be sure that there's no output connected together, or similar errors. But as I want to explore every ways as fast as possible, I'll be really pleased to test the SDRAM with the application you were talking about in your last post.
Thank you.
TranslateQuoteReplyEditDelete
2009-04-02 09:37:41 Re: Different behaviour of the same system image on CM-BF537E
Jean-Francois Argentino (FRANCE)
Message: 72049
A new intersting way, with a kernel compiled for any DSP revision, my application runs fine on two CM-BF537E with DSP rev 0.2 for more than two hours now.
I've tryed the same system on the rev0.3, the PPI transfer locks as usual.
So maybe there's a rev0.3 hardware anomaly without its workaround implemented in the 2008r1.5 kernel? I'm downlaoding the SVN toolchain and linux distrib to try it asap...
PS: I'm aware of the anomaly #05000254 and I'm using the suggested workaround to generate my internal framesync signal, the only other anomaly which apply to the rev0.3 only is the #05000341 but it looks like it is handled in the kernel 2008r1.5...
TranslateQuoteReplyEditDelete
2009-04-02 09:48:05 Re: Different behaviour of the same system image on CM-BF537E
Mike Frysinger (UNITED STATES)
Message: 72051
or you hit another anomaly not yet found ...
QuoteReplyEditDelete
2009-04-02 12:58:46 Re: Different behaviour of the same system image on CM-BF537E
Jean-Francois Argentino (FRANCE)
Message: 72069
I found a big mistakes in my problem description, sorry.
In fact the kernel i'm using is from SVN revision 5613 (2008-11-10), the rest (ulibc, standard user apps _AND_ toolchain) is 2008r1.5-rc3.
But I confirm that if compiled for bfin BF537E (any revision), it works only on bfin rev 0.2 (from some hours now on two different DSP).
For the SVN version 6243 of the uclinux distribution (I'm compiling it with the SVN version 3310 toolchain), I'm not able until now to port my own GPIO driver since the gpio interface has changed. Myke, could you help me please to translate my driver?
For the old version, when opening a gpio as an input (connected to an IRQ), we:
1°)gpio_request
2°)gpio_direction_input
3°)request_irq
You can see my code attached (file "bfin_gpio_rev5613.c")
Now, for the svn trunk, I think I just have to replace these 3 function calls with request_irq only, as I try to gpio_request and gpio_irq_request without success, but the "open" then freeze... (file "bfin_gpio_rev6243.c") and I can't found a fix with the dokuwiki nor the sample driver...
I'll will run a test with dsp revision 0.2 and 0.3 with distribution 2008r1.5-rc3, I'll post results tomorrow morning (french hour).
If you need more information / code to see if it could be an undiscovered anomaly, I'll be pleased to send anything you want.
bfin_gpio_rev6243.c
bfin_gpio_rev5613.c
bfin_gpio.h
TranslateQuoteReplyEditDelete
2009-04-02 13:56:05 Re: Different behaviour of the same system image on CM-BF537E
Mike Frysinger (UNITED STATES)
Message: 72070
you're saying that attempting to open the device driver with latest trunk and bfin_gpio_rev6243.c causes your app to hang in the open() function ? does the whole system hang or just your app ? where in the open() function does it hang ?
i dont really see anything wrong in the driver in this regard, but i only gave it a quick look ...
you dont need to call set_gpio_{polar,edge} after request_irq as the flags you give to the latter function set up the polarity and such automatically
you also dont need to call SSYNC() as the gpio framework will call it for you when needed
i dont know why you commented out the spin lock free in the first error case, but that's not right. i also see what you're trying to do with the atomic open count field, but that does not prevent race conditions ... you need to grab the spin lock first before touching anything else ... same in the close() function ...
QuoteReplyEditDelete
2009-04-03 03:18:39 Re: Different behaviour of the same system image on CM-BF537E
Jean-Francois Argentino (FRANCE)
Message: 72086
It's the whole system thats hang when opening a GPIO, I know that the IRQ is properly requested but I need to do more investigation. I'll do more test after having corrected my driver as you've suggested.
Regarding my main purpose: 2 BF537E rev0.2 run all the night my application without any problem...
TranslateQuoteReplyEditDelete
2009-04-03 09:16:11 Re: Different behaviour of the same system image on CM-BF537E
Mike Frysinger (UNITED STATES)
Message: 72109
then you should sprinkle printk()'s in the open() func so you know where exactly it fails
QuoteReplyEditDelete
2009-04-03 11:05:56 Re: Different behaviour of the same system image on CM-BF537E
Jean-Francois Argentino (FRANCE)
Message: 72122
With the SVN trunk kernel, toolchain and userspace, kernel compiled for any DSP revision, all tests running with the same system image, loaded in the same SDRAM address, on the same hardware plateform, the only changing part is the CM-BF537E board. Then 3 BF537E rev0.3 run OK for more than 15mn, 7 stops before 1mn. The only one DSP rev0.3 which run OK with previous kernel version still works. And the two rev0.2 still works too.
Do you think it could be an uncovered anomaly. Do you thing that the SDRAM test proposed by Robin could be useful?
I'll will try to exchange 4 rev0.3 by rev0.2 version, I'm waiting for the bluetechnix answer.
I'm open to any suggestions to try to solve (or undestand) my problem...
Regards.
TranslateQuoteReplyEditDelete
2009-04-03 11:17:42 Re: Different behaviour of the same system image on CM-BF537E
Jean-Francois Argentino (FRANCE)
Message: 72125
Mike, I forget to ask you a question, do you know why I have to negate the GPIO state read with the "gpio_get_value()" function? Whatever I do the "set_gpio_polar ()" and the "set_gpio_edge ()".
More over, with the new GPIO framework, I'm not able to change the IRQ triggering type with "set_irq_type()" function, it looks like this function call "bfin_gpio_request_irq()" since I can see these kernel messages:
bfin-gpio: gpio-irq57 trying to reserve GPIO 7 as gpio-irq
bfin-gpio: GPIO 7 is already reserved as gpio-irq !
when calling "set_irq_type()".
TranslateQuoteReplyEditDelete
2009-04-05 11:40:30 Re: Different behaviour of the same system image on CM-BF537E
Jean-Francois Argentino (FRANCE)
Message: 72177
Mike forget my first question, i just forget the table 14-2 of the HRM...
I've compiled the kernel with DEBUG_MMRS option, and when my application locks, if I cat "sys/kernel/debug/blackfin/Port I-O/PORTFIO"
file, I can see "0x0400", but as PF7 is configured as an interrupt on a falling edge, I'm waiting for a 0x0480 since with a scope I can see that the level on PF7 is gone low! As if the the blackfin hardware miss the falling edge...
But after a timeout, my application force the pin reading like that:
unsigned short edge = get_gpio_edge (minor);
set_gpio_edge (minor, 0U);
value = (get_gpio_polar (minor)) ? gpio_get_value (minor) : ! gpio_get_value (minor);
set_gpio_edge (minor, edge);
And it reads a 1 instead of the true level on the pin... What am I missing?
TranslateQuoteReplyEditDelete
2009-04-05 14:45:55 Re: Different behaviour of the same system image on CM-BF537E
Mike Frysinger (UNITED STATES)
Message: 72180
why are you mucking with the edge/polarity ? why cant you just use gpio_get_value() ?
QuoteReplyEditDelete
2009-04-06 10:48:23 Re: Different behaviour of the same system image on CM-BF537E
Jean-Francois Argentino (FRANCE)
Message: 72260
Yes I see that it's already done in the "gpio_get_value" function...
By the way, I'm now quite sure that my problem is DMA relative. In fact, when the application stop, I can see that I'm locked into the PPI reading function, which is waiting for a DMA end transfer interrupt which never occures. But if I'm displaying the DMA0 registers (when blocked into the PPI reading) here's what I can read:
/sys/kernel/debug/blackfin/DMA Controller/DMA0_CONFIG = 0x00a7
/sys/kernel/debug/blackfin/DMA Controller/DMA0_CURR_X_COUNT = 0x0000
/sys/kernel/debug/blackfin/DMA Controller/DMA0_IRQ_STATUS = 0x0008
How is it possible to have the DMA_RUN bit set while current x count is equal to 0 ?
Here's are the interrupts relative registers in this situation:
/sys/kernel/debug/blackfin/Interrupt Controller/ILAT = 0x00000000
/sys/kernel/debug/blackfin/Interrupt Controller/IMASK = 0x0000ffff
/sys/kernel/debug/blackfin/Interrupt Controller/IPEND = 0x00008000 (CAN Rx interrupt pending?)
/sys/kernel/debug/blackfin/Interrupt Controller/IPRIO = 0x00000000
and
/sys/kernel/debug/blackfin/System Interrupt Controller Register File/SIC_IAR0 = 0x22215000 <<< DMA0 to IVG8
/sys/kernel/debug/blackfin/System Interrupt Controller Register File/SIC_IMASK = 0x0802181c <<< DMA error masked (?), DMA0 enabled
/sys/kernel/debug/blackfin/System Interrupt Controller Register File/SIC_ISR = 0x00040000 <<< DMA0 interrupt deasserted
/sys/kernel/debug/blackfin/System Interrupt Controller Register File/SIC_RVECT = 0x2000
/sys/kernel/debug/blackfin/System Interrupt Controller Register File/SWRST = 0x0000
/sys/kernel/debug/blackfin/System Interrupt Controller Register File/SYSCR = 0x0000
Do you have any idea on what the problem could be?
TranslateQuoteReplyEditDelete
2009-04-06 11:04:17 Re: Different behaviour of the same system image on CM-BF537E
Mike Frysinger (UNITED STATES)
Message: 72261
our gpio/ppi expert is on vacation atm ...
did you register the ppi error irq ? also, the DMA_RUN bit is a little wonky ... look at anomaly 05000119 to see what i mean.
QuoteReplyEditDelete
2009-04-06 11:19:11 Re: Different behaviour of the same system image on CM-BF537E
Jean-Francois Argentino (FRANCE)
Message: 72262
Yes I did register the PPI error IRQ, I never detect one. But for DMA error, I can see that the interruption is masked... Is it normal?
In my PPI driver, in the end of transfer handler, I check the DMA_ERR bit and increment a counter if set. But I never detect any too.
TranslateQuoteReplyEditDelete
2009-04-06 11:24:14 Re: Different behaviour of the same system image on CM-BF537E
Mike Frysinger (UNITED STATES)
Message: 72263
if the irq isnt requested, then it's not surprising it'd be masked ...
you can see /proc/interrupts for all registered irqs
QuoteReplyEditDelete
2009-04-06 12:14:23 Re: Different behaviour of the same system image on CM-BF537E
Jean-Francois Argentino (FRANCE)
Message: 72266
Yes, that's a good reason...
Well, I've registered and handled it, but when the DMA transfer hang, it doesn't rise a DMA error interrupt... To be continued...
TranslateQuoteReplyEditDelete
2009-04-07 15:05:17 Re: Different behaviour of the same system image on CM-BF537E
Jean-Francois Argentino (FRANCE)
Message: 72346
Here are the results of the last tests I've done:
1°) One of my co-worker just lend me a CM-BF537E with a 0.2 rev on it. The application have made 3millions DMA transfers before locking. The two CM-BF537E in revision 0.2 where the application works (for more than 24hours) have a Analog Devices serial number of 958108.1, the one which is lended to me have a serial of 1346662.1. I precise that, for rev0.3 version where the application locks, I seen as many succesful transfers only once, generaly it locks after something like some hundreds and some thousands transfers.
2°) I've isolate the problem by stoping sending the data over ETHERNET, and by continuously polling the PF7 level instead of using it as an interrupt pin, without any success.
3°) I'm sure that, when my driver is waiting for the DMA transfer to end, the CURRENT_X_COUNT is null, but the IPEND bit corresponding to the IVG I "branch" the PPI_DMA channel on (I've tried on the IVG7 too) is not asserted. I try to figure out how the linux interrupt layer works, but can you tell me if, for any reasons, is it possible that the kernel unassert the IPEND bit for me, or is it my IRQ handler that do it when called?
Before I try to make the same application under VDK (which will take a long time for me as I never use it), do you have any ideas, or workaround I could use?
I post my PPI driver, if you have the time to take a look, I add a gptimers function to arch/blackfin/kernel/gptimers.c because of anomaly #05000254, but if I take a look here: blackfin.uclinux.org/gf/project/uclinux-dist/tracker/?action=TrackerItemEdit&tracker_id=144&tracker_item_id=4711
it sounds like you (Mike), are not agree with my solution. Could you please explain where I am wrong? And do you think it could be in relation with the problem I'm facing?
ppi_test.c
bfin_ppi.c
bfin_ppi.h
TranslateQuoteReplyEditDelete
2009-04-07 15:06:36 Re: Different behaviour of the same system image on CM-BF537E
Jean-Francois Argentino (FRANCE)
Message: 72348
Sorry just forgot the gptimers.c diff file
gptimers.c.diff