[#5430] SPI kernel lockup

Document created by Aaronwu Employee on Sep 4, 2013
Version 1Show Document
  • View in full screen mode

[#5430] SPI kernel lockup

Submitted By: Nathan Whittington

Open Date

2009-08-12 13:59:12     Close Date

2009-08-27 02:16:10

Priority:

Medium     Assignee:

Robin Getz

Status:

Closed     Fixed In Release:

N/A

Found In Release:

N/A     Release:

Category:

Kernel Functions     Board:

STAMP

Processor:

ALL     Silicon Revision:

Is this bug repeatable?:

Yes     Resolution:

Fixed

Uboot version or rev.:

    Toolchain version or rev.:

2009

App binary format:

N/A     

Summary: SPI kernel lockup

Details:

 

In the kernel configuration enable Device Drivers > Character devices > Blackfin ADSP SPI ADC support (BFIN_SPI_ADC).

 

I've attached a simple program to demonstrate the bug.  The board will fail randomly within a few seconds to a minute.   The kernel locks up hard - with a dead serial console and unresponsive to pings.

 

Changing the size of the 'address' buffer to 1 or 2 allows it to work so I suspect there's a problem with buffer boundaries. 

 

This is using the 2009R1 kernel (commit 3348dc6bda184c7f79d2d54dbb53838adf07cd8c) and the 2009 toolchain on a BF537.

 

I've been running this code without a connection to a SPI device (write-only).

 

 

Follow-ups

 

--- Mike Frysinger                                           2009-08-12 19:09:09

if you only need to write a few bytes at a time, why dont you use the spidev

driver ?

 

--- Nathan Whittington                                       2009-08-13 11:51:33

This test was distilled down from a lcd driver I wrote for the 2008 release

(which was very similar to the bfin adc driver).  Our app requires writing a 3

byte address followed by a 1k framebuffer.

 

This is a regression; it worked fine with the 2008 kernel but is broken with

the latest 2009 stuff.

 

I'll look into modifying it to use spidev.

 

 

--- Mike Frysinger                                           2009-08-13 14:27:38

i'm not saying this isnt a bug, i'm just suggesting you move to a common driver

that is in mainline already rather than use our simple one

 

--- Nathan Whittington                                       2009-08-20 12:03:59

Thanks for the suggestion Mike, I got the spidev driver working and it seems

more stable.

 

Before making the switch I did some more testing:

 

The problem isn't related to DMA; I was able to get a crash both when using the

adc driver method and a simple kernel spi_write() call. 

 

I was mistaken in saying it ran ok when writing a 2 byte buffer.  Writing a 2

byte buffer fails after a long time (1 hour+).  Writing a 3 byte buffer using

the attached code typically fails in a few seconds.  Writing a 4 byte, 512 byte

or 1024 byte buffer fails after varying periods of time. 

 

--- Cameron Barfield                                         2009-08-24 11:56:20

It seemed to be more stable while using spidev, but I think it actually just

became more random.

 

Friday afternoon I ran a bunch of quick tests with soft reboots and power

cycles and everything seemed find. I set up a long test over the weekend and the

kernel locked up, same as before.

 

--- Nathan Whittington                                       2009-08-24 14:20:04

After switching my app to use the spidev driver it ran for 2 full days without

problems.  I can't remember it ever running more than 2 hours previously.  I

called the problem fixed but when we started to package up the product, it

failed again and then I saw two more failures after warm and cold boots. 

 

This morning I saw one failure with the spidev driver after my first warm

restart and I've since warm and cold booted the board ~40 times over the course

of 6 hours with no failures. 

 

Changing the speed of the bus doesn't help.

 

We've seen the issue on at least 4 different pieces of hardware.

 

Enabling the SPI debug option in the kernel config doesn't give any more

information. 

 

I've gone back and verified that the 2008 release works - it ran all weekend.

We unfortunately can't just go back and use the 2008 kernel because of SPORT bus

fixes introduced in 2009.

 

I tried building several intermediate commits but the four different versions I

tried all failed to build or to run for a variety of reasons.

 

This is a blocking issue for getting our product out the door but at this point

I'm out of ideas of stuff to try. 

 

 

--- Yi Li                                                    2009-08-24 22:05:36

Is it possible to reproduce this bug on BF537-STAMP? If so, could you attach the

kernel configuration file here?

 

--- Yi Li                                                    2009-08-24 22:29:04

OK - I can reproduce on BF537-STAMP (chip 0.2) now, running the test for about 1

minute.

 

--- Yi Li                                                    2009-08-24 22:55:04

Found the cause. here is a workaround (not a fix). You can turn off

"CONFIG_EXACT_HWERR". I ran the test for half a hour it works OK.

(This option (and similar options for debugging) is useful for development. But

in final products, they need to be turned off for better performance).

 

Using JTAG it would be easy to locate where kernel hangs:

 

arch/blackfin/include/asm/entry.h:

 

#ifndef CONFIG_EXACT_HWERR

 

#define TIMER_INTERRUPT_ENTRY(N)                                        \

    [--sp] = SYSCFG;                                                    \

    [--sp] = P0;        /*orig_p0*/                                     \

    [--sp] = R0;        /*orig_r0*/                                     \

    [--sp] = (R7:0,P5:0);                                               \

    p0.l = lo(IPEND);                                                   \

    p0.h = hi(IPEND);                                                   \

    r1 = [p0];                                                          \

    R0 = (N);                                                           \

    jump __common_int_entry;

#else /* CONFIG_EXACT_HWERR is defined */

#define TIMER_INTERRUPT_ENTRY(N)                                        \

    [--sp] = SYSCFG;                                                    \

    [--sp] = P0;        /*orig_p0*/                                     \

    [--sp] = R0;        /*orig_r0*/                                     \

    [--sp] = (R7:0,P5:0);                                               \

    R1 = ASTAT;                                                         \

    P0.L = LO(ILAT);                                                    \

    P0.H = HI(ILAT);                                                    \

    SSYNC;    <------------------------------ Hang Here                 \

    R0 = [P0];                                                          \

    CC = BITTST(R0, EVT_IVHW_P);                                        \

    IF CC JUMP 1f;                                                      \

    ASTAT = R1;                                                         \

    p0.l = lo(IPEND);                                                   \

    p0.h = hi(IPEND);                                                   \

    r1 = [p0];                                                          \

    R0 = (N);                                                           \

    jump __common_int_entry;                                            \

1:  ASTAT = R1;                                                         \

    RAISE N;                                                            \

    (R7:0, P5:0) = [SP++];                                              \

    SP += 0x8;                                                          \

    SYSCFG = [SP++];                                                    \

    RTI;

#endif  /* CONFIG_EXACT_HWERR */

 

By default, CONFIG_EXACT_HWERR is turned on, kernel hangs at "SSYNC"

in TIMER_INTERRUPT_ENTRY(), this seems to be an anomaly. We need to further

debug this anomaly. But for now, you can disable CONFIG_EXACT_HWERR.

 

What is the chip revision for your BF537?

 

--- Cameron Barfield                                         2009-08-25 12:49:44

Yi --

 

I'll try the work around.

 

One board we're using has a BF536 revision 0.2. Two other boards have a BF537

revision 0.2.

 

--- Robin Getz                                               2009-08-26 13:56:18

Hmm -- this smells like 283 -- I will add a workaround into the necessary places

(if you can test it out).

 

Looks like I will need to fix both INTERRUPT_ENTRY and TIMER_INTERRUPT_ENTRY

 

-Robin

 

--- Cameron Barfield                                         2009-08-26 14:08:36

283?

 

We'll do some testing for you.

 

I put in the workaround Yi came up with, and that seems solid.

 

--- Robin Getz                                               2009-08-26 15:36:04

283 refers to the anomaly number.

 

Like Yi said - for production testing - you should turn off all the debug

options that we added (which this release -- was more than normal - since it was

a combination of getting tired of the same questions on the forums -- and

tracking down some hard to find issues).

 

I'll commit something to trunk and the branch soon -- just testing now.

 

-Robin

 

--- Robin Getz                                               2009-08-26 16:55:13

Fix committed to branch and trunk.

 

If Yi can test it out - it would be great - I ran things here, and they worked,

but I could not re-create the issue.

 

-Robin

 

--- Yi Li                                                    2009-08-27 00:37:41

I tested the fix using the "spi_lock.c" test case. It ran well for a

hour. So I think this issue is fixed.

 

Attached my configuration to reproduce this issue.

 

--- Robin Getz                                               2009-08-27 07:17:01

Thanks for testing.

 

Unless Nathan or Cameron have anything else -- I'll mark as closed.

 

--- Cameron Barfield                                         2009-08-27 12:15:06

Robin, give it a couple of days please. We ran a test case for 48 hours once

without failure. The next test we ran failed in 5 minutes.

 

--- Cameron Barfield                                         2009-08-27 12:23:44

One more thing, Robin. According to the git logs, you only changed 1 file in the

2009 branch, but 3 in the trunk. Is that correct?

 

--- Robin Getz                                               2009-08-27 14:48:59

That is because I only fixed things on the branch, (and left some duplicated

code).

 

On trunk I removed duplicated code to use the new macro I added.

 

-Robin

 

 

 

    Files

    Changes

    Commits

    Dependencies

    Duplicates

    Associations

    Tags

 

File Name     File Type     File Size     Posted By

spi_lockup.c    text/x-csrc    434    Nathan Whittington

bugreport.tar.gz    application/x-gzip    18707    Yi Li

Attachments

Outcomes