2008-08-05 11:16:29 MPU protection maturity
Michael McTernan (UNITED KINGDOM)
Message: 59944
Hi there,
I checked out the trunk kernel from SVN, and built with MPU protection enabled to see if it would catch anything. So far I've found a couple of problems, but they look like maybe the MPU protection has malfunctioned?
For example, in one report 'sort' hit the CPLB for an illegal instruction fetch on what looks like the return from a system call. After this one of our applications also faults on return from a system call after handling an interrupt from a device. Oddly DCPLB_FAULT_ADDR = <0xff8016a4> for both the apps that hit the CPLB. The trace for sort and then the second app to fail, l1l2 are attached.
What is the most likely cause of the problem, or what can I do to track it down? Is MPU protection in SVN read for prime time?
Regards,
Mike
cplb-sort.txt
cplb-l1l2.txt
QuoteReplyEditDelete
2008-08-05 11:27:21 Re: MPU protection maturity
Mike Frysinger (UNITED STATES)
Message: 59949
MPU in the 2008R1 release is not usable. however, it should be stable in trunk (and maybe the 2008R1 branch). if trunk is crashing, we should def fix it.
i dont suppose you have a reduced test case that doesnt involve a custom driver so that we can debug things on our side ?
i'm assuming of course that your hardware is actually recent enough and you arent running on older silicon with unusable anomalies ...
QuoteReplyEditDelete
2008-08-05 12:24:49 Re: MPU protection maturity
Michael McTernan (UNITED KINGDOM)
Message: 59952
Hi,
Cool - I took the SVN code since it the 2008R1.5 release note suggested it maybe stable.
> i dont suppose you have a reduced test case that doesnt
> involve a custom driver so that we can debug things on our side ?
I've been trying to reduce the system as much as possible, but I think it is unlikely I can get to a 100 line or so example I know this makes it difficult for you guys, but I'm struggling too!
That said, the fpgairq driver listed in the l1l2 trace is based on ADI GPL'd code and is small (~300 lines). It has an interrupt handler thus:
static irqreturn_t fpgairq_handler(int irq, void *p)
{
/* Disable this interrupt */
disable_irq_nosync(irq);
/* Enable any blocked task to run */
fpgairq_waitcond = 1;
wake_up_interruptible(&fpgairq_waitq);
return IRQ_HANDLED;
}
And a read function that basically blocks for the interrupt but returns no data and disables the source (it leaves it upto the user space app to do some work and to then cause the external interrupt source to be cleared via GPIOs):
static ssize_t fpgairq_read(struct file *filp, char *buf, size_t size, loff_t * offp)
{
const int irq = gpio_to_irq(PIN_IRQ);
if(MINOR(filp->f_dentry->d_inode->i_rdev) != 0)
return -ENODEV;
if(size <= 0)
return -EMSGSIZE;
if(!access_ok(VERIFY_WRITE, buf, size))
return -EFAULT;
/* Check if the interrupt line is already asserted.
* If it is, don't sleep and wait for the next interrupt;
* simply return without any rescheduling or switching.
*/
if(get_gpio_data(PIN_IRQ) == 0)
{
/* Clear the wait condition */
fpgairq_waitcond = 0;
/* Enable the interrupt source */
enable_irq(irq);
/* Now wait for an interrupt */
wait_event_interruptible(fpgairq_waitq, fpgairq_waitcond != 0);
/* The interrupt handler will have disabled the interrupt again */
}
return size;
}
The last part is the open function which sets up the interrupt as a level sensitive source. It also juggles the local interrupts to ensure the enable and disable counts stay balanced even if the interrupt source is already asserted when starting things up:
static int fpgairq_open(struct inode *inode, struct file *filp)
{
int irqsaveflags;
int irq;
if(MINOR(inode->i_rdev) != 0)
{
printk(KERN_NOTICE "fpgairq: no dev\n");
return -ENODEV;
}
/* Get the interrupt number for the GPIO */
irq = gpio_to_irq(PIN_IRQ);
if(irq < 0)
{
printk(KERN_NOTICE "fpgairq: gpio_to_irq() failed\n");
return -EIO;
}
/* Disable all interrupts to prevent the target interrupt
* occuring before configuration is complete.
*/
local_irq_save(irqsaveflags);
/* Request required interrupt, which also allocates the GPIO */
if(request_irq(irq, fpgairq_handler, IRQF_TRIGGER_HIGH, "fpgairq", NULL) < 0)
{
local_irq_restore(irqsaveflags);
printk(KERN_NOTICE "fpgairq: request_irq() failed\n");
return -EIO;
}
/* Disable the interrupt source until read() is called.
* This ensures the disables and enables balance.
*/
disable_irq(irq);
/* Restore interrupt state */
local_irq_restore(irqsaveflags);
/* Done */
return 0;
}
I will post the complete driver source on request, but these three parts are the only things that get repeatedly executed, so I think other parts are irrelevant.
For silicon revision I'm using BF533 ver 0.5 which still has loads or errata, but hopefully is recent enough? I think the hardware platform is okay as we didn't see such problems under 2006R1, although notably the MPU protection is new. The apps do however run happily with mudflap or Valgrind.
Regards,
Mike
QuoteReplyEditDelete
2008-08-05 17:39:10 Re: MPU protection maturity
Robin Getz (UNITED STATES)
Message: 59955
Mike:
Hmm - your trace looks pretty weird - is this a stock kernel from here? or is there a kgdb or ADEOS patch (or anything else) on top?
Your trace says that the RETI goes off (which is fine) but that goes to -
_ex_dcplb_miss + 0x76
which should not happen. when processing cplb_miss - interrupts should be off - there isn't much that can be going on, except a stack corruption...
Did you try stack checking?
https://docs.blackfin.uclinux.org/doku.php?id=debuging_applications#stack_checking
-Robin
BTW - I fixed a mistake in the trace printout - can you svn up, and try again?
QuoteReplyEditDelete
2008-08-05 23:02:13 Re: MPU protection maturity
Mike Frysinger (UNITED STATES)
Message: 59961
ok, i looked through the current 2008R1 branch and it seems to have all the critical fixes, so it should be usable. i just backported the few additional pieces from trunk, but none of those should affect you.
QuoteReplyEditDelete
2008-08-06 04:36:40 Re: MPU protection maturity
Michael McTernan (UNITED KINGDOM)
Message: 59973
> Hmm - your trace looks pretty weird - is this a stock kernel from here?
> or is there a kgdb or ADEOS patch (or anything else) on top?
It's from your SVN trunk with no patches applied. I've just pulled in my local .config and added a GPIO IRQ driver.
> Your trace says that the RETI goes off (which is fine) but that goes to -
> _ex_dcplb_miss + 0x76
> which should not happen.
Right, particularly it's an iIllegal instruction fetch, but the address is decoded as being "sort + 0x227a", within the executable attempting to run. Does the MPU split code and data in FLAT binaries? If not, then this should be impossible, unless there is a bug in the CPLB handling? I'll check on the next crash that the instruction addess isn't in a data region.
> Did you try stack checking?
Yes, but it has not found any problems for me. I've also setup the stacks to be quite large
> BTW - I fixed a mistake in the trace printout - can you svn up, and try again?
Cool - I'm at SVN revision 5110 now. I'll give it a go.
Regards,
Mike
QuoteReplyEditDelete
2008-08-06 04:47:53 Re: MPU protection maturity
Michael McTernan (UNITED KINGDOM)
Message: 59975
Excellent! 2008R1 is where I want to eventually be, so having the features on that is great.
For the moment I'll keep on SVN trunk to see if it can help diagnose/debug the problems I'm seeing.
Many Thanks,
Mike
QuoteReplyEditDelete
2008-08-06 04:53:56 Re: MPU protection maturity
Mike Frysinger (UNITED STATES)
Message: 59977
the MPU does not know anything about FLAT or FDPIC or any file format. higher layers simply describe memory regions with permissions (read/write/exec) and the MPU respects those.
you can see the mapping perms in /proc/maps
QuoteReplyEditDelete
2008-08-06 05:23:04 Re: MPU protection maturity
Michael McTernan (UNITED KINGDOM)
Message: 60026
Hi,
So I got the latest from SVN and got some more traces. The instructions are now in there, great!
Attached are some more traces. The apps that fail are generally different on each fun, but all fail after returning via __common_int_entry + 0xd8 } RTI. In the case of one app, it is a really simple script, which I also attached (get-proc-addr). Another observation is that often 2 or 3 apps fail at the same time while others are fine.
In the case of the l1l2 trace (a C program), I checked the fault address:
$ bfin-uclinux-addr2line -e l1l2.gdb 0x7828a
libc/sysdeps/linux/common/read.c:15
This fits with a return from a read system call after the GPIO IRQ has been recieved. Unfortunately the ICPLB faults this instruction fetch
Could this be a bug in the memory protections? Would it be possible to dump the protection maps for each application after such a fault too (assuming they are not corrupted)?
Regards,
Mike
get-proc-addr
cplb-l1l2.txt
cplb-init.txt
cplb-get-proc-addr.txt
QuoteReplyEditDelete
2008-08-06 05:40:14 Re: MPU protection maturity
Mike Frysinger (UNITED STATES)
Message: 60028
hmm, i dont suppose your system has any buttons or such ? i'm wondering if we can transition from "fpga signaling gpio irq" to "push button signaling gpio irq" as that way we should be able to test on a BF533-EZKIT or BF537-EZKIT board ...
all maps are available at /proc/maps
about how many IRQs do you see before things crash ?
QuoteReplyEditDelete
2008-08-06 06:02:28 Re: MPU protection maturity
Michael McTernan (UNITED KINGDOM)
Message: 60029
> hmm, i dont suppose your system has any buttons or such ?
Not really, but I can solder some onto test points if needed.
> about how many IRQs do you see before things crash ?
There's an interrupt every 10ms, and it takes somewhere around 2-5 minutes to fail, so quite a lot.
What are you thinking here - are you looking for a setup that you could reproduce? If so, a signal generator on a GPIO with the driver maybe one way to do it. I have a BF537-EZKIT board here, so trying to produce a demonstration of the problem on that probably looks like a step in the right direction?
> all maps are available at /proc/maps
Ah - it's also summarised in the dump:
CPLB protection violation
- Illegal instruction fetch access (memory protection violation).
Deferred Exception context
CURRENT PROCESS:
COMM=l1l2 PID=286
TEXT = 0x05000040-0x050c53a0 DATA = 0x050c53a4-0x05106014
BSS = 0x05106014-0x05237ea4 USER-STACK = 0x05337f6c
return address: [0x050782ca]; contents of:
...
DCPLB_FAULT_ADDR: <0xff8016a4> /* kernel dynamic memory */
ICPLB_FAULT_ADDR: <0x050782ca> [ l1l2 + 0x7828a ]
So it is in the text section for sure - but why doesn't the ICPLB accept that?
Regards,
Mike
QuoteReplyEditDelete
2008-08-06 06:15:59 Re: MPU protection maturity
Mike Frysinger (UNITED STATES)
Message: 60031
yes, a signal generator on a bf537-ezkit would be great as we can easily reproduce that setup
QuoteReplyEditDelete
2008-08-06 09:26:39 Re: MPU protection maturity
Michael McTernan (UNITED KINGDOM)
Message: 60036
Hi,
I've managed to reproduce this on a BF537-EZLITE board. Here's the steps:
- Take the 2008R1-rc8 release, replace kernel with SVN trunk version 5110.
- Add in devfpgairq.c to build into the kernel as a module.
- Kernel config as attached, but I'm not sure it matters.
- Build a uImage
- Build the attached test program.
- Boot the uImage, transfer the test program to the target (ftp'd it on)
- Run test program with /dev/fpgairq0 as the only argument:
- Now setup the signal generator to generate a square pulse high for around 500us, at about 220Hz - see the attached screen shot from a scope monitoring the signal.
- The signal should be applied to PF5, which can be found on pin 13 of the SPI header if SW5 is set so that push-button 4 is disabled. Pin 20 is a ground on the same connector.
- Note: You need at least a 0.3 BF537 chip on your board as an anomoly on the 0.2 prevents MPU protection being used.
Now is the tricky part. The test program should be displaying a number of dots across the screen, and things are generally okay in this mode for a while. I tend to then telnet in and run "watch cat /proc/interrupts" and this command is starting or exiting (with ctrl+C) that I find apps sometimes suddenly hit a CPLB miss and die, as in the attached trace. I think these look extremely similar to those seen on the BF533, although the hardware trace options are different (I'm going to try a rebuild with the same config to see if it is the same, but I think you'll agree from the traces attached that it is very similar).
I'm also going to try and tweak the driver and test program to see if I can make it happen more instantly (I think more interrupts probably help catch the case), but please let me know if there is anything else I can be doing to help this investigation or if there is anything I should try.
Regards,
Mike
Edit 1: Running the siggen at ~20KHz and removing the usleep() from testirq.c makes it fail within seconds.
Edit 2: The "watch /proc/interrupts" command seems significant. Without this the system does not fail. Watching /proc/maps also causes a failure so the problem is not limited to this file.
Edit 3: Got a failure with extended backtrace, and it definately looks the same as on the BF533 board. See the extended-trace.txt attachement. Note that during the tracing the target also reset - possibly a double fault?
Edit 4: Problem also occurs with a command such as "watch ls -l", so I think it's possibly an interrupt being handled during any system call that triggers the problem.
siggen.png
testirq.c
consolelog.txt
extended-trace.txt
config
devfpgairq.c
QuoteReplyEditDelete
2008-08-06 12:26:05 Re: MPU protection maturity
Robin Getz (UNITED STATES)
Message: 60043
Michael:
While someone here sets the things with the EZKits, - can you capture a longer trace? I would like to see (if possible) is the entry point into common_int_entry()
-Robin
QuoteReplyEditDelete
2008-08-06 12:39:03 Re: MPU protection maturity
Michael McTernan (UNITED KINGDOM)
Message: 60046
Sure - let me know any settings or patches you think may help and I can try them. I'm just rebuilding with a 16k trace buffer.
BTW, I've tried disabling the MPU protection and my test runs fine. I'm pretty sure there's a bug in the MPU protection regarding interrupts at just the wrong time. And while 20KHz sustained is an absurd rate for interrupts, it does show this problem quite rapidly. (Note our application which originally showed this problem has only a 100Hz timer.
QuoteReplyEditDelete
2008-08-06 13:32:16 Re: MPU protection maturity
Michael McTernan (UNITED KINGDOM)
Message: 60051
Attached are a couple of a long traces using a 16k trace buffer and single level loop compression.
The first trace only had a.out fail, which is the testirq app that just reads from /dev/fpgairq0 repeatedly. The second trace was one that took out other unrelated apps too, in this case a.out and klogd died.
From the first trace the entry point looks to be here:
840 Target : <0x00013d04> { _irq_enter + 0x0 }
Source : <0xffa002f8> { _asm_do_IRQ + 0x24 } CALL pcrel
841 Target : <0xffa002d4> { _asm_do_IRQ + 0x0 }
Source : <0xffa00ea0> { _do_irq + 0x74 } JUMP.L
842 Target : <0xffa00e9c> { _do_irq + 0x70 }
Source : <0xffa00e70> { _do_irq + 0x44 } IF !CC JUMP
843 Target : <0xffa00e6b> { _do_irq + 0x3f }
Source : <0xffa00e76> { _do_irq + 0x4a } IF CC JUMP
844 Target : <0xffa00e72> { _do_irq + 0x46 }
Source : <0xffa00e66> { _do_irq + 0x3a } IF CC JUMP
845 Target : <0xffa00e2c> { _do_irq + 0x0 }
Source : <0xffa00c06> { __common_int_entry + 0x5e } CALL pcrel
846 Target : <0xffa00ba8> { __common_int_entry + 0x0 }
Source : <0xffa00db6> { _evt_evt12 + 0xa } JUMP.S
847 Target : <0xffa00dac> { _evt_evt12 + 0x0 }
Source : <0xffa004b6> { _bfin_return_from_exception + 0x6 } RTX
848 Target : <0x0008c608> { _rb_erase + 0x1c8 }
Source : <0x0008c5f8> { _rb_erase + 0x1b8 } IF !CC JUMP
849 Target : <0x0008c5e8> { _rb_erase + 0x1a8 }
Source : <0x0008c450> { _rb_erase + 0x10 } IF !CC JUMP
Regards,
Mike
cplb-largetrace1.txt.gz
cplb-largetrace0.txt.gz
QuoteReplyEditDelete
2008-08-07 14:21:02 Re: MPU protection maturity
Robin Getz (UNITED STATES)
Message: 60120
Mike:
I think we should be able to replicate things on our side tomorrow.
Thanks for your help.
-robin
QuoteReplyEditDelete
2008-08-08 18:18:55 Re: MPU protection maturity
Mike Frysinger (UNITED STATES)
Message: 60192
ok, with all your info, we've been able to reproduce things over here ... we'll bang on it and get back to you
QuoteReplyEditDelete
2008-08-12 05:59:09 Re: MPU protection maturity
Michael McTernan (UNITED KINGDOM)
Message: 60285
Any news on progress yet?
Mike
QuoteReplyEditDelete
2008-08-20 14:35:48 Re: MPU protection maturity
Robin Getz (UNITED STATES)
Message: 60774
Michael:
Mike opened a bug - this issue is tracked there.
No updates, means no progress - it doesn't mean no one is working on it.
QuoteReplyEditDelete
2008-08-28 18:50:22 Re: MPU protection maturity
Mike Frysinger (UNITED STATES)
Message: 61285
the issue should be fixed in latest trunk/branch if you want to svn up
QuoteReplyEditDelete
2008-08-29 11:26:38 Re: MPU protection maturity
Michael McTernan (UNITED KINGDOM)
Message: 61381
Cool - I'm checking it out now. Unfortunately the weekend means I won't have access to the hardware I need, so probably won't be able to confirm the fix until Monday
Regards,
Mike
QuoteReplyEditDelete
2008-09-01 08:47:45 Re: MPU protection maturity
Michael McTernan (UNITED KINGDOM)
Message: 61478
I've got my setup together again and verified it shows the problem prior to the fix. I've tested this for a couple of hours now with the fix and seen no problems so it looks to be very definately fixed.
Many Thanks,
Mike
