2008-07-24 04:53:01 Indefinate stall at .Lcplb_error: call _panic_cplb_error???
Michael McTernan (UNITED KINGDOM)
Message: 59307
Hi,
I'm using 2008R1rc8 and that tool chain on a BF533 (ver0.5) board which has previously ran 2006R1 without problem.
From time to time everything hard locks under 2008R1, typically when running my application which uses lots of DMA sport access, pflag level interrupts and fplag toggling. Using the kernel soft lockup detection finds nothing, and the magic sys-req is also unresponsive. I hacked in an NMI handler to go off when the watchdog expires to printk() a message, dump the Blackin trace buffer and then dump running tasks as though sys-req T were requested. Under the failure case the NMI handler does not appear to ran - there is no output from it (letting the dog expire deliberately when not locked does run the NMI dump).
Using an ADI HPUSB ICE, I've profiled the system while locked and found all the samples come back with the same address which corresponds to the following from cplbhdlr.S:
.Lcplb_error:
R1 = sp;
SP += -12;
call _panic_cplb_error; <--- Address is for this instruction
SP += 12;
JUMP.L _handle_bad_cplb;
Interestingly if I halt and step or resume the processor when in this condition, it starts to handle the NMI and dumps the first prink() messages to the serial port but then doesn't get much further so isn't helpful in debugging. The trace buffer also isn’t displayed by the debugger (could just be a UI think though).
So... I'm not sure if the processor could indefinitely stall on a single call instruction, checking the anomalies I didn't see anything specific to call. I've also built the kernel for silicon rev 0.5 to match the chip so should have all anomaly workarounds too.
Has anyone seen anything similar or have any suggestions of things to try? I can post more information on request if I've missed anything.
Regards,
Mike
PS: Kernel config attached
config
QuoteReplyEditDelete
2008-07-24 05:05:29 Re: Indefinate stall at .Lcplb_error: call _panic_cplb_error???
Mike Frysinger (UNITED STATES)
Message: 59313
i seem to recall that 2008R1 did not handle cplb misses terribly well. so you'd be better off not triggering those when possible ;).
you may also be triggering a double fault which means the system is hosed. try applying this patch and see if you get different behavior:
QuoteReplyEditDelete
2008-07-24 07:53:12 Re: Indefinate stall at .Lcplb_error: call _panic_cplb_error???
Robin Getz (UNITED STATES)
Message: 59347
Mike:
I thought we fixed up all the CPLB issues before 2008R1 was cut.
Michael:
Just to check - can you compile user/blkfin-test/crash_test/traps_test.c and make sure that works on your platform?
Just something like:
bfin-linux-uclibc-gcc -O2 user/blkfin-test/crash_test/traps_test.c -o ./traps_test
and then on the target:
root:/> dmesg -n 2
root:/> ./traps_test -1
Running test 0 : Data access misaligned address violation
... PASS (test failed as expected by signal 7: Bus error)
Running test 1 : Data access CPLB miss
... PASS (test failed as expected by signal 7: Bus error)
Running test 2 : Data access multiple CPLB hits/Null Pointer
... PASS (test failed as expected by signal 11: Segmentation fault)
Running test 3 : Instruction fetch misaligned address violation
... PASS (test failed as expected by signal 7: Bus error)
Running test 4 : Instruction fetch CPLB miss
... PASS (test failed as expected by signal 7: Bus error)
Running test 5 : l1_instruction_access
...
Thanks
-Robin
QuoteReplyEditDelete
2008-07-24 08:53:19 Re: Indefinate stall at .Lcplb_error: call _panic_cplb_error???
Michael McTernan (UNITED KINGDOM)
Message: 59348
I've tried patching this in, but the system still hangs in the same way without resetting.
Also, wouldn't the NMI still get serviced during a double-fault loop? It's also odd how breaking and resuming with the ICE allows the processor to advance to the NMI handler.
QuoteReplyEditDelete
2008-07-24 08:58:36 Re: Indefinate stall at .Lcplb_error: call _panic_cplb_error???
Michael McTernan (UNITED KINGDOM)
Message: 59350
Cool - I didn't know about this test.
It compiled and runs with the following output:
root:/mnt> ./traps_test -1
Running test 0 : Data access misaligned address violation
... PASS (test failed as expected by signal 7: Bus error)
Running test 1 : Data access CPLB miss
... PASS (test failed as expected by signal 7: Bus error)
Running test 2 : Data access multiple CPLB hits/Null Pointer
... PASS (test failed as expected by signal 11: Segmentation fault)
Running test 3 : Instruction fetch misaligned address violation
... PASS (test failed as expected by signal 7: Bus error)
Running test 4 : Instruction fetch CPLB miss
... FAIL (test failed, but not with the right signal)
(We expected 7 'Bus error' but instead we got 4 'Illegal instruction')
Running test 5 : l1_instruction_access
... PASS (test failed as expected by signal 7: Bus error)
Running test 6 : Illegal use of supervisor resource - Instruction
... PASS (test failed as expected by signal 4: Illegal instruction)
Running test 7 : Illegal use of supervisor resource - MMR
... PASS (test failed as expected by signal 7: Bus error)
Running test 8 : Instruction fetch multiple CPLB hits - Jump to zero
... FAIL (test failed, but not with the right signal)
(We expected 11 'Segmentation fault' but instead we got 4 'Illegal instruction')
Running test 9 : RAISE 5 instruction
... PASS (test failed as expected by signal 4: Illegal instruction)
Running test 10 : Invalid Opcode
... PASS (test failed as expected by signal 4: Illegal instruction)
Looks fine - the fail in test 4 isn't a bit concern for me, and particularly it didn't cause the system to lockup in the same way that I am seeing
Mike
QuoteReplyEditDelete
2008-07-24 10:46:39 Re: Indefinate stall at .Lcplb_error: call _panic_cplb_error???
Robin Getz (UNITED STATES)
Message: 59352
Michael:
If the core double faults - it is a unrecoverable event - from what I have observed, no NMI, no anything will get you out.
Only reset, and watchdog.
When you said " From time to time" - is there any specific about what you are doing?
QuoteReplyEditDelete
2008-07-24 11:51:11 Re: Indefinate stall at .Lcplb_error: call _panic_cplb_error???
Michael McTernan (UNITED KINGDOM)
Message: 59356
Hi Robin,
> If the core double faults - it is a unrecoverable event - from what I have observed, no NMI, no anything will get you out.
Ok - you confirmed it . I checked the hardware reference and it agrees:
"If an exception is caused while executing code within the exception handler, the NMI handler, the reset vector, or in emulator mode:
...
• The generated exception is not taken."
So my attempt to use the NMI to dump info about where stuck was miguided.
> When you said " From time to time" - is there any specific about what you are doing?
Not really. It's just running the OS with a bunch of apps and threads that talk to some peripherals via GPIO's and the SPORT and shuffle data through UDP sockets. "From time to time" is because I've seen the system hang in this way after being up for either a few minutes or on some occasions a couple of hours. Unfortunately we need stability that counts into days.
I can believe that one of these apps may be causing a problem, but really need to be able to debug it - the same apps build and run fine under 2006R1, and where possible run clean under Valgrind on i686.
One thing I plan to do is leave the board running over night without our apps to see if it has hung in the morning or not. If that's good, I'll then try adding a few apps at a time to see if/when it starts failing. Unfortunately this could take a lot of time and still not point closer than the application causing the problem. Having built a large complex system on 2006R1, it's hard to strip back down to test 2008R1
Regards,
Mike
QuoteReplyEditDelete
2008-07-24 21:06:36 Re: Indefinate stall at .Lcplb_error: call _panic_cplb_error???
Robin Getz (UNITED STATES)
Message: 59367
Michael:
A few things to try:
- MPU (turn on the memory protection unit on the Blackfin)
- turn on mudflap. (see https://docs.blackfin.uclinux.org/doku.php?id=debuging_applications#mudflap - better than valgrind, but slower).
With the patch applied - things should reset if it doublefaults...
I'll poke at things tomorrow.
-Robin
QuoteReplyEditDelete
2008-07-30 06:49:31 Re: Indefinate stall at .Lcplb_error: call _panic_cplb_error???
Michael McTernan (UNITED KINGDOM)
Message: 59599
Hi,
I tried the MPU and found this didn't like the NAND driver, faulting shortly after boot.
Building with mudflap found one benign problem, but correction of this problem hasn't stopped the hanging.
Some good news is that without all the apps, or with a different set of test apps loaded, the system is stable. So I think the problem is being caused by one of my apps, although am struggling to determine which one is the problem, let alone where within the app the issue is. Further work with Valgrind is in order, but so far nothing has been found there.
I've started looked further into the mechanism of the hang. From the trace buffer, what appears to be be happening is that the system ends up executing an infinite loop through _ex_trap_c, returning to the _cplb_mgr which immediately causes another exception bouncing through _ex_trap_c again.
I'm not sure what came before that, the but failing part looks to be fomr _cplb_mgr here:
.Lisearch:
R1 = [P0-0x100]; /* Address for this CPLB */
P0 contains a very small value, generally 4, and looks to be causing an exception on execution. This takes IVT3 straight to _ex_trap_c, which sets up and raises IVT5 with no effect. After the RAISE 5 at the end of _ex_trap_c, the ICE shows IMASK = 0x3F, ILAT=0x20, IPEND=0x19.
I'm uncertain here, but the debugger seems to suggest that IPEND[4]=1 means interrupts are globally disabled. Stepping past the RTX at the end of _ex_trap_c, IPEND is updated to 0x11 while ILAT and IMASK remain unchanged; the PC returns back to the same exception generating _cplb_mgr instruction. IVT5 is never taken since interrupts are globally disabled, so we don't get to _exception_to_level5 which has been setup.
So, I think there is possibly a problem in an app which managed to cause _cplb_mgr to raise an exception while interrupts are globally disabled. Unfortuantely this looks to be preventing IVT3 from deferring the handling to IVT5, leaving infinite exceptions being generated in sequence.
I'm wondering if deferring the exception handling to IVT5 could be skipped in such cases so that something meaningful can be outputted?
Mike