2008-08-05 11:16:29     MPU protection maturity

Document created by Aaronwu Employee on Aug 7, 2013
Version 1Show Document
  • View in full screen mode

2008-08-05 11:16:29     MPU protection maturity

Michael McTernan (UNITED KINGDOM)

Message: 59944   

 

Hi there,

 

I checked out the trunk kernel from SVN, and built with MPU protection enabled to see if it would catch anything.  So far I've found a couple of problems, but they look like maybe the MPU protection has malfunctioned?

 

For example, in one report 'sort' hit the CPLB for an illegal instruction fetch on what looks like the return from a system call.  After this one of our applications also faults on return from a system call after handling an interrupt from a device.  Oddly DCPLB_FAULT_ADDR = <0xff8016a4> for both the apps that hit the CPLB.  The trace for sort and then the second app to fail, l1l2 are attached.

 

What is the most likely cause of the problem, or what can I do to track it down?  Is MPU protection in SVN read for prime time?

 

Regards,

 

Mike

 

 

 

cplb-sort.txt

cplb-l1l2.txt

QuoteReplyEditDelete

 

 

2008-08-05 11:27:21     Re: MPU protection maturity

Mike Frysinger (UNITED STATES)

Message: 59949   

 

MPU in the 2008R1 release is not usable.  however, it should be stable in trunk (and maybe the 2008R1 branch).  if trunk is crashing, we should def fix it.

 

i dont suppose you have a reduced test case that doesnt involve a custom driver so that we can debug things on our side ?

 

i'm assuming of course that your hardware is actually recent enough and you arent running on older silicon with unusable anomalies ...

QuoteReplyEditDelete

 

 

2008-08-05 12:24:49     Re: MPU protection maturity

Michael McTernan (UNITED KINGDOM)

Message: 59952   

 

Hi,

 

Cool - I took the SVN code since it the 2008R1.5 release note suggested it maybe stable.

 

> i dont suppose you have a reduced test case that doesnt

> involve a custom driver so that we can debug things on our side ?

 

I've been trying to reduce the system as much as possible, but I think it is unlikely I can get to a 100 line or so example   I know this makes it difficult for you guys, but I'm struggling too!

 

That said, the fpgairq driver listed in the l1l2 trace is based on ADI GPL'd code and is small (~300 lines).  It has an interrupt handler thus:

 

static irqreturn_t fpgairq_handler(int irq, void *p)

{

    /* Disable this interrupt */

    disable_irq_nosync(irq);

 

    /* Enable any blocked task to run */

    fpgairq_waitcond = 1;

    wake_up_interruptible(&fpgairq_waitq);

 

    return IRQ_HANDLED;

}

 

 

And a read function that basically blocks for the interrupt but returns no data and disables the source (it leaves it upto the user space app to do some work and to then cause the external interrupt source to be cleared via GPIOs):

 

static ssize_t fpgairq_read(struct file *filp, char *buf, size_t size, loff_t * offp)

{

    const int irq = gpio_to_irq(PIN_IRQ);

 

    if(MINOR(filp->f_dentry->d_inode->i_rdev) != 0)

        return -ENODEV;

 

    if(size <= 0)

        return -EMSGSIZE;

 

    if(!access_ok(VERIFY_WRITE, buf, size))

        return -EFAULT;

 

    /* Check if the interrupt line is already asserted.

     *  If it is, don't sleep and wait for the next interrupt;

     *  simply return without any rescheduling or switching.

     */

    if(get_gpio_data(PIN_IRQ) == 0)

    {

        /* Clear the wait condition */

        fpgairq_waitcond = 0;

 

        /* Enable the interrupt source */

        enable_irq(irq);

 

        /* Now wait for an interrupt */

        wait_event_interruptible(fpgairq_waitq, fpgairq_waitcond != 0);

 

        /* The interrupt handler will have disabled the interrupt again */

    }

 

    return size;

}

 

The last part is the open function which sets up the interrupt as a level sensitive source.  It also juggles the local interrupts to ensure the enable and disable counts stay balanced even if the interrupt source is already asserted when starting things up:

 

static int fpgairq_open(struct inode *inode, struct file *filp)

{

    int irqsaveflags;

    int irq;

 

    if(MINOR(inode->i_rdev) != 0)

    {

        printk(KERN_NOTICE "fpgairq: no dev\n");

        return -ENODEV;

    }

 

    /* Get the interrupt number for the GPIO */

    irq = gpio_to_irq(PIN_IRQ);

    if(irq < 0)

    {

        printk(KERN_NOTICE "fpgairq: gpio_to_irq() failed\n");

        return -EIO;

    }

 

    /* Disable all interrupts to prevent the target interrupt

     *  occuring before configuration is complete.

     */

    local_irq_save(irqsaveflags);

 

    /* Request required interrupt, which also allocates the GPIO */

    if(request_irq(irq, fpgairq_handler, IRQF_TRIGGER_HIGH, "fpgairq", NULL) < 0)

    {

        local_irq_restore(irqsaveflags);

        printk(KERN_NOTICE "fpgairq: request_irq() failed\n");

        return -EIO;

    }

 

    /* Disable the interrupt source until read() is called.

     *  This ensures the disables and enables balance.

     */

    disable_irq(irq);

 

    /* Restore interrupt state */

    local_irq_restore(irqsaveflags);

 

    /* Done */

    return 0;

}

 

I will post the complete driver source on request, but these three parts are the only things that get repeatedly executed, so I think other parts are irrelevant.

 

For silicon revision I'm using BF533 ver 0.5 which still has loads or errata, but hopefully is recent enough?  I think the hardware platform is okay as we didn't see such problems under 2006R1, although notably the MPU protection is new.  The apps do however run happily with mudflap or Valgrind.

 

Regards,

 

Mike

QuoteReplyEditDelete

 

 

2008-08-05 17:39:10     Re: MPU protection maturity

Robin Getz (UNITED STATES)

Message: 59955   

 

Mike:

 

Hmm - your trace looks pretty weird - is this a stock kernel from here? or is there a kgdb or ADEOS patch (or anything else) on top?

 

Your trace says that the RETI goes off (which is fine) but that goes to -

 

_ex_dcplb_miss + 0x76

 

which should not happen. when processing cplb_miss - interrupts should be off - there isn't much that can be going on, except a stack corruption...

 

Did you try stack checking?

 

https://docs.blackfin.uclinux.org/doku.php?id=debuging_applications#stack_checking

 

-Robin

 

BTW - I fixed a mistake in the trace printout - can you svn up, and try again?

QuoteReplyEditDelete

 

 

2008-08-05 23:02:13     Re: MPU protection maturity

Mike Frysinger (UNITED STATES)

Message: 59961   

 

ok, i looked through the current 2008R1 branch and it seems to have all the critical fixes, so it should be usable.  i just backported the few additional pieces from trunk, but none of those should affect you.

QuoteReplyEditDelete

 

 

2008-08-06 04:36:40     Re: MPU protection maturity

Michael McTernan (UNITED KINGDOM)

Message: 59973   

 

> Hmm - your trace looks pretty weird - is this a stock kernel from here?

> or is there a kgdb or ADEOS patch (or anything else) on top?

 

It's from your SVN trunk with no patches applied.  I've just pulled in my local .config and added a GPIO IRQ driver.

 

> Your trace says that the RETI goes off (which is fine) but that goes to -

> _ex_dcplb_miss + 0x76

> which should not happen.

 

Right, particularly it's an iIllegal instruction fetch, but the address is decoded as being "sort + 0x227a", within the executable attempting to run.  Does the MPU split code and data in FLAT binaries?  If not, then this should be impossible, unless there is a bug in the CPLB handling?  I'll check on the next crash that the instruction addess isn't in a data region.

 

> Did you try stack checking?

 

Yes, but it has not found any problems for me.  I've also setup the stacks to be quite large

 

> BTW - I fixed a mistake in the trace printout - can you svn up, and try again?

 

Cool - I'm at SVN revision 5110 now.  I'll give it a go.

 

Regards,

 

Mike

QuoteReplyEditDelete

 

 

2008-08-06 04:47:53     Re: MPU protection maturity

Michael McTernan (UNITED KINGDOM)

Message: 59975   

 

Excellent!  2008R1 is where I want to eventually be, so having the features on that is great.

 

For the moment I'll keep on SVN trunk to see if it can help diagnose/debug the problems I'm seeing.

 

Many Thanks,

 

Mike

 

 

QuoteReplyEditDelete

 

 

2008-08-06 04:53:56     Re: MPU protection maturity

Mike Frysinger (UNITED STATES)

Message: 59977   

 

the MPU does not know anything about FLAT or FDPIC or any file format.  higher layers simply describe memory regions with permissions (read/write/exec) and the MPU respects those.

 

you can see the mapping perms in /proc/maps

QuoteReplyEditDelete

 

 

2008-08-06 05:23:04     Re: MPU protection maturity

Michael McTernan (UNITED KINGDOM)

Message: 60026   

 

Hi,

 

So I got the latest from SVN and got some more traces.  The instructions are now in there, great!

 

Attached are some more traces.  The apps that fail are generally different on each fun, but all fail after returning via __common_int_entry + 0xd8 } RTI.  In the case of one app, it is a really simple script, which I also attached (get-proc-addr).  Another observation is that often 2 or 3 apps fail at the same time while others are fine.

 

In the case of the l1l2 trace (a C program), I checked the fault address:

 

$ bfin-uclinux-addr2line -e l1l2.gdb 0x7828a

libc/sysdeps/linux/common/read.c:15

 

This fits with a return from a read system call after the GPIO IRQ has been recieved.  Unfortunately the ICPLB faults this instruction fetch

 

Could this be a bug in the memory protections?  Would it be possible to dump the protection maps for each application after such a fault too (assuming they are not corrupted)?

 

Regards,

 

Mike

 

get-proc-addr

cplb-l1l2.txt

cplb-init.txt

cplb-get-proc-addr.txt

QuoteReplyEditDelete

 

 

2008-08-06 05:40:14     Re: MPU protection maturity

Mike Frysinger (UNITED STATES)

Message: 60028   

 

hmm, i dont suppose your system has any buttons or such ?  i'm wondering if we can transition from "fpga signaling gpio irq" to "push button signaling gpio irq" as that way we should be able to test on a BF533-EZKIT or BF537-EZKIT board ...

 

all maps are available at /proc/maps

 

about how many IRQs do you see before things crash ?

QuoteReplyEditDelete

 

 

2008-08-06 06:02:28     Re: MPU protection maturity

Michael McTernan (UNITED KINGDOM)

Message: 60029   

 

> hmm, i dont suppose your system has any buttons or such ?

 

Not really, but I can solder some onto test points if needed.

 

> about how many IRQs do you see before things crash ?

 

There's an interrupt every 10ms, and it takes somewhere around 2-5 minutes to fail, so quite a lot.

 

What are you thinking here - are you looking for a setup that you could reproduce?  If so, a signal generator on a GPIO with the driver maybe one way to do it.  I have a BF537-EZKIT board here, so trying to produce a demonstration of the problem on that probably looks like a step in the right direction?

 

> all maps are available at /proc/maps

 

Ah - it's also summarised in the dump:

 

CPLB protection violation

- Illegal instruction fetch access (memory protection violation).

Deferred Exception context

CURRENT PROCESS:

COMM=l1l2 PID=286

TEXT = 0x05000040-0x050c53a0        DATA = 0x050c53a4-0x05106014

BSS = 0x05106014-0x05237ea4  USER-STACK = 0x05337f6c

 

return address: [0x050782ca]; contents of:

...

DCPLB_FAULT_ADDR: <0xff8016a4> /* kernel dynamic memory */

ICPLB_FAULT_ADDR: <0x050782ca> [ l1l2 + 0x7828a ]

 

So it is in the text section for sure - but why doesn't the ICPLB accept that?

 

Regards,

 

Mike

QuoteReplyEditDelete

 

 

2008-08-06 06:15:59     Re: MPU protection maturity

Mike Frysinger (UNITED STATES)

Message: 60031   

 

yes, a signal generator on a bf537-ezkit would be great as we can easily reproduce that setup

QuoteReplyEditDelete

 

 

2008-08-06 09:26:39     Re: MPU protection maturity

Michael McTernan (UNITED KINGDOM)

Message: 60036   

 

Hi,

 

I've managed to reproduce this on a BF537-EZLITE board.  Here's the steps:

 

- Take the 2008R1-rc8 release, replace kernel with SVN trunk version 5110.

- Add in devfpgairq.c to build into the kernel as a module.

- Kernel config as attached, but I'm not sure it matters.

- Build a uImage

- Build the attached test program.

- Boot the uImage, transfer the test program to the target (ftp'd it on)

- Run test program with /dev/fpgairq0 as the only argument:

 

- Now setup the signal generator to generate a square pulse high for around 500us, at about 220Hz - see the attached screen shot from a scope monitoring the signal.

- The signal should be applied to PF5, which can be found on pin 13 of the SPI header if SW5 is set so that push-button 4 is disabled.  Pin 20 is a ground on the same connector.

 

- Note: You need at least a 0.3 BF537 chip on your board as an anomoly on the 0.2 prevents MPU protection being used.

 

Now is the tricky part.  The test program should be displaying a number of dots across the screen, and things are generally okay in this mode for a while.  I tend to then telnet in and run "watch cat /proc/interrupts" and this command is starting or exiting (with ctrl+C) that I find apps sometimes suddenly hit a CPLB miss and die, as in the attached trace.  I think these look extremely similar to those seen on the BF533, although the hardware trace options are different (I'm going to try a rebuild with the same config to see if it is the same, but I think you'll agree from the traces attached that it is very similar).

 

I'm also going to try and tweak the driver and test program to see if I can make it happen more instantly (I think more interrupts probably help catch the case), but please let me know if there is anything else I can be doing to help this investigation or if there is anything I should try.

 

Regards,

 

Mike

 

Edit 1: Running the siggen at ~20KHz and removing the usleep() from testirq.c makes it fail within seconds.

Edit 2: The "watch /proc/interrupts" command seems significant.  Without this the system does not fail.  Watching /proc/maps also causes a failure so the problem is not limited to this file.

Edit 3: Got a failure with extended backtrace, and it definately looks the same as on the BF533 board.  See the extended-trace.txt attachement.  Note that during the tracing the target also reset - possibly a double fault?

Edit 4: Problem also occurs with a command such as "watch ls -l", so I think it's possibly an interrupt being handled during any system call that triggers the problem.

 

 

 

 

 

 

 

siggen.png

testirq.c

consolelog.txt

extended-trace.txt

config

devfpgairq.c

QuoteReplyEditDelete

 

 

2008-08-06 12:26:05     Re: MPU protection maturity

Robin Getz (UNITED STATES)

Message: 60043   

 

Michael:

 

While someone here sets the things with the EZKits, - can you capture a longer trace? I would like to see (if possible) is the entry point into common_int_entry()

 

-Robin

QuoteReplyEditDelete

 

 

2008-08-06 12:39:03     Re: MPU protection maturity

Michael McTernan (UNITED KINGDOM)

Message: 60046   

 

Sure - let me know any settings or patches you think may help and I can try them.  I'm just rebuilding with a 16k trace buffer.

 

BTW, I've tried disabling the MPU protection and my test runs fine.  I'm pretty sure there's a bug in the MPU protection regarding interrupts at just the wrong time.  And while 20KHz sustained is an absurd rate for interrupts, it does show this problem quite rapidly.  (Note our application which originally showed this problem has only a 100Hz timer.

QuoteReplyEditDelete

 

 

2008-08-06 13:32:16     Re: MPU protection maturity

Michael McTernan (UNITED KINGDOM)

Message: 60051   

 

Attached are a couple of  a long traces using a 16k trace buffer and single level loop compression.

 

The first trace only had a.out fail, which is the testirq app that just reads from /dev/fpgairq0 repeatedly.  The second trace was one that took out other unrelated apps too, in this case a.out and klogd died.

 

From the first trace the entry point looks to be here:

 

840 Target : <0x00013d04> { _irq_enter + 0x0 }

     Source : <0xffa002f8> { _asm_do_IRQ + 0x24 } CALL pcrel

841 Target : <0xffa002d4> { _asm_do_IRQ + 0x0 }

     Source : <0xffa00ea0> { _do_irq + 0x74 } JUMP.L

842 Target : <0xffa00e9c> { _do_irq + 0x70 }

     Source : <0xffa00e70> { _do_irq + 0x44 } IF !CC JUMP

843 Target : <0xffa00e6b> { _do_irq + 0x3f }

     Source : <0xffa00e76> { _do_irq + 0x4a } IF CC JUMP

844 Target : <0xffa00e72> { _do_irq + 0x46 }

     Source : <0xffa00e66> { _do_irq + 0x3a } IF CC JUMP

845 Target : <0xffa00e2c> { _do_irq + 0x0 }

     Source : <0xffa00c06> { __common_int_entry + 0x5e } CALL pcrel

846 Target : <0xffa00ba8> { __common_int_entry + 0x0 }

     Source : <0xffa00db6> { _evt_evt12 + 0xa } JUMP.S

847 Target : <0xffa00dac> { _evt_evt12 + 0x0 }

     Source : <0xffa004b6> { _bfin_return_from_exception + 0x6 } RTX

848 Target : <0x0008c608> { _rb_erase + 0x1c8 }

     Source : <0x0008c5f8> { _rb_erase + 0x1b8 } IF !CC JUMP

849 Target : <0x0008c5e8> { _rb_erase + 0x1a8 }

     Source : <0x0008c450> { _rb_erase + 0x10 } IF !CC JUMP

 

Regards,

 

Mike

 

cplb-largetrace1.txt.gz

cplb-largetrace0.txt.gz

QuoteReplyEditDelete

 

 

2008-08-07 14:21:02     Re: MPU protection maturity

Robin Getz (UNITED STATES)

Message: 60120   

 

Mike:

 

I think we should be able to replicate things on our side tomorrow.

 

Thanks for your help.

 

-robin

QuoteReplyEditDelete

 

 

2008-08-08 18:18:55     Re: MPU protection maturity

Mike Frysinger (UNITED STATES)

Message: 60192   

 

ok, with all your info, we've been able to reproduce things over here ... we'll bang on it and get back to you

QuoteReplyEditDelete

 

 

2008-08-12 05:59:09     Re: MPU protection maturity

Michael McTernan (UNITED KINGDOM)

Message: 60285   

 

Any news on progress yet?

 

Mike

 

 

QuoteReplyEditDelete

 

 

2008-08-20 14:35:48     Re: MPU protection maturity

Robin Getz (UNITED STATES)

Message: 60774   

 

Michael:

 

Mike opened a bug - this issue is tracked there.

 

https://blackfin.uclinux.org/gf/project/uclinux-dist/tracker/?action=TrackerItemEdit&tracker_item_id=4328

 

No updates, means no progress - it doesn't mean no one is working on it.

QuoteReplyEditDelete

 

 

2008-08-28 18:50:22     Re: MPU protection maturity

Mike Frysinger (UNITED STATES)

Message: 61285   

 

the issue should be fixed in latest trunk/branch if you want to svn up

QuoteReplyEditDelete

 

 

2008-08-29 11:26:38     Re: MPU protection maturity

Michael McTernan (UNITED KINGDOM)

Message: 61381   

 

Cool - I'm checking it out now.  Unfortunately the weekend means I won't have access to the hardware I need, so probably won't be able to confirm the fix until Monday

 

Regards,

 

Mike

QuoteReplyEditDelete

 

 

2008-09-01 08:47:45     Re: MPU protection maturity

Michael McTernan (UNITED KINGDOM)

Message: 61478   

 

I've got my setup together again and verified it shows the problem prior to the fix.  I've tested this for a couple of hours now with the fix and seen no problems so it looks to be very definately fixed.

 

Many Thanks,

 

Mike

Outcomes