[#5430] SPI kernel lockup
Submitted By: Nathan Whittington
Open Date
2009-08-12 13:59:12 Close Date
2009-08-27 02:16:10
Priority:
Medium Assignee:
Robin Getz
Status:
Closed Fixed In Release:
N/A
Found In Release:
N/A Release:
Category:
Kernel Functions Board:
STAMP
Processor:
ALL Silicon Revision:
Is this bug repeatable?:
Yes Resolution:
Fixed
Uboot version or rev.:
Toolchain version or rev.:
2009
App binary format:
N/A
Summary: SPI kernel lockup
Details:
In the kernel configuration enable Device Drivers > Character devices > Blackfin ADSP SPI ADC support (BFIN_SPI_ADC).
I've attached a simple program to demonstrate the bug. The board will fail randomly within a few seconds to a minute. The kernel locks up hard - with a dead serial console and unresponsive to pings.
Changing the size of the 'address' buffer to 1 or 2 allows it to work so I suspect there's a problem with buffer boundaries.
This is using the 2009R1 kernel (commit 3348dc6bda184c7f79d2d54dbb53838adf07cd8c) and the 2009 toolchain on a BF537.
I've been running this code without a connection to a SPI device (write-only).
Follow-ups
--- Mike Frysinger 2009-08-12 19:09:09
if you only need to write a few bytes at a time, why dont you use the spidev
driver ?
--- Nathan Whittington 2009-08-13 11:51:33
This test was distilled down from a lcd driver I wrote for the 2008 release
(which was very similar to the bfin adc driver). Our app requires writing a 3
byte address followed by a 1k framebuffer.
This is a regression; it worked fine with the 2008 kernel but is broken with
the latest 2009 stuff.
I'll look into modifying it to use spidev.
--- Mike Frysinger 2009-08-13 14:27:38
i'm not saying this isnt a bug, i'm just suggesting you move to a common driver
that is in mainline already rather than use our simple one
--- Nathan Whittington 2009-08-20 12:03:59
Thanks for the suggestion Mike, I got the spidev driver working and it seems
more stable.
Before making the switch I did some more testing:
The problem isn't related to DMA; I was able to get a crash both when using the
adc driver method and a simple kernel spi_write() call.
I was mistaken in saying it ran ok when writing a 2 byte buffer. Writing a 2
byte buffer fails after a long time (1 hour+). Writing a 3 byte buffer using
the attached code typically fails in a few seconds. Writing a 4 byte, 512 byte
or 1024 byte buffer fails after varying periods of time.
--- Cameron Barfield 2009-08-24 11:56:20
It seemed to be more stable while using spidev, but I think it actually just
became more random.
Friday afternoon I ran a bunch of quick tests with soft reboots and power
cycles and everything seemed find. I set up a long test over the weekend and the
kernel locked up, same as before.
--- Nathan Whittington 2009-08-24 14:20:04
After switching my app to use the spidev driver it ran for 2 full days without
problems. I can't remember it ever running more than 2 hours previously. I
called the problem fixed but when we started to package up the product, it
failed again and then I saw two more failures after warm and cold boots.
This morning I saw one failure with the spidev driver after my first warm
restart and I've since warm and cold booted the board ~40 times over the course
of 6 hours with no failures.
Changing the speed of the bus doesn't help.
We've seen the issue on at least 4 different pieces of hardware.
Enabling the SPI debug option in the kernel config doesn't give any more
information.
I've gone back and verified that the 2008 release works - it ran all weekend.
We unfortunately can't just go back and use the 2008 kernel because of SPORT bus
fixes introduced in 2009.
I tried building several intermediate commits but the four different versions I
tried all failed to build or to run for a variety of reasons.
This is a blocking issue for getting our product out the door but at this point
I'm out of ideas of stuff to try.
--- Yi Li 2009-08-24 22:05:36
Is it possible to reproduce this bug on BF537-STAMP? If so, could you attach the
kernel configuration file here?
--- Yi Li 2009-08-24 22:29:04
OK - I can reproduce on BF537-STAMP (chip 0.2) now, running the test for about 1
minute.
--- Yi Li 2009-08-24 22:55:04
Found the cause. here is a workaround (not a fix). You can turn off
"CONFIG_EXACT_HWERR". I ran the test for half a hour it works OK.
(This option (and similar options for debugging) is useful for development. But
in final products, they need to be turned off for better performance).
Using JTAG it would be easy to locate where kernel hangs:
arch/blackfin/include/asm/entry.h:
#ifndef CONFIG_EXACT_HWERR
#define TIMER_INTERRUPT_ENTRY(N) \
[--sp] = SYSCFG; \
[--sp] = P0; /*orig_p0*/ \
[--sp] = R0; /*orig_r0*/ \
[--sp] = (R7:0,P5:0); \
p0.l = lo(IPEND); \
p0.h = hi(IPEND); \
r1 = [p0]; \
R0 = (N); \
jump __common_int_entry;
#else /* CONFIG_EXACT_HWERR is defined */
#define TIMER_INTERRUPT_ENTRY(N) \
[--sp] = SYSCFG; \
[--sp] = P0; /*orig_p0*/ \
[--sp] = R0; /*orig_r0*/ \
[--sp] = (R7:0,P5:0); \
R1 = ASTAT; \
P0.L = LO(ILAT); \
P0.H = HI(ILAT); \
SSYNC; <------------------------------ Hang Here \
R0 = [P0]; \
CC = BITTST(R0, EVT_IVHW_P); \
IF CC JUMP 1f; \
ASTAT = R1; \
p0.l = lo(IPEND); \
p0.h = hi(IPEND); \
r1 = [p0]; \
R0 = (N); \
jump __common_int_entry; \
1: ASTAT = R1; \
RAISE N; \
(R7:0, P5:0) = [SP++]; \
SP += 0x8; \
SYSCFG = [SP++]; \
RTI;
#endif /* CONFIG_EXACT_HWERR */
By default, CONFIG_EXACT_HWERR is turned on, kernel hangs at "SSYNC"
in TIMER_INTERRUPT_ENTRY(), this seems to be an anomaly. We need to further
debug this anomaly. But for now, you can disable CONFIG_EXACT_HWERR.
What is the chip revision for your BF537?
--- Cameron Barfield 2009-08-25 12:49:44
Yi --
I'll try the work around.
One board we're using has a BF536 revision 0.2. Two other boards have a BF537
revision 0.2.
--- Robin Getz 2009-08-26 13:56:18
Hmm -- this smells like 283 -- I will add a workaround into the necessary places
(if you can test it out).
Looks like I will need to fix both INTERRUPT_ENTRY and TIMER_INTERRUPT_ENTRY
-Robin
--- Cameron Barfield 2009-08-26 14:08:36
283?
We'll do some testing for you.
I put in the workaround Yi came up with, and that seems solid.
--- Robin Getz 2009-08-26 15:36:04
283 refers to the anomaly number.
Like Yi said - for production testing - you should turn off all the debug
options that we added (which this release -- was more than normal - since it was
a combination of getting tired of the same questions on the forums -- and
tracking down some hard to find issues).
I'll commit something to trunk and the branch soon -- just testing now.
-Robin
--- Robin Getz 2009-08-26 16:55:13
Fix committed to branch and trunk.
If Yi can test it out - it would be great - I ran things here, and they worked,
but I could not re-create the issue.
-Robin
--- Yi Li 2009-08-27 00:37:41
I tested the fix using the "spi_lock.c" test case. It ran well for a
hour. So I think this issue is fixed.
Attached my configuration to reproduce this issue.
--- Robin Getz 2009-08-27 07:17:01
Thanks for testing.
Unless Nathan or Cameron have anything else -- I'll mark as closed.
--- Cameron Barfield 2009-08-27 12:15:06
Robin, give it a couple of days please. We ran a test case for 48 hours once
without failure. The next test we ran failed in 5 minutes.
--- Cameron Barfield 2009-08-27 12:23:44
One more thing, Robin. According to the git logs, you only changed 1 file in the
2009 branch, but 3 in the trunk. Is that correct?
--- Robin Getz 2009-08-27 14:48:59
That is because I only fixed things on the branch, (and left some duplicated
code).
On trunk I removed duplicated code to use the new macro I added.
-Robin
Files
Changes
Commits
Dependencies
Duplicates
Associations
Tags
File Name File Type File Size Posted By
spi_lockup.c text/x-csrc 434 Nathan Whittington
bugreport.tar.gz application/x-gzip 18707 Yi Li