2010-10-04 03:54:06 crash busybox process?
Rob Maris (GERMANY)
Message: 94161
I'm encountering a crash (at least two are captured where it can be seen that they have a comparable trace). I'm not sure if it is due to busybox upgrade to 1.17.2. I'd consider the backstep to 1.16.2. Please advice, especially whether another cause can be possible according to the info in the crash log, too.
NULL pointer access
Kernel OOPS in progress
Deferred Exception context
CURRENT PROCESS:
COMM=busybox PID=1374 CPU=0
TEXT = 0x01a80000-0x01ac8de0 DATA = 0x00840de0-0x008443fc
BSS = 0x008443fc-0x01ba0000 USER-STACK = 0x01bbfd30
return address: [0x000c44a2]; contents of:
...
ADSP-BF537-0.2(Detected 0.3) 400(MHz CCLK) 100(MHz SCLK) (mpu off)
Linux version 2.6.34.7-ADI-2010R1-vanbreda (rob@rob-desktop) (gcc version 4.3.5 (ADI-trunk/git-08d8861) ) #889 PREEMPT Wed Sep 29 21:03:52 CEST 2010
SEQUENCER STATUS: Not tainted
SEQSTAT: 00000027 IPEND: 8008 IMASK: ffff SYSCFG: 0006
EXCAUSE : 0x27
physical IVG3 asserted : <0xffa007a0> { _trap + 0x0 }
physical IVG15 asserted : <0xffa00f00> { _evt_system_call + 0x0 }
...
RETE: <0x008a7e20> /* kernel dynamic memory (maybe user-space) */
RETN: <0x00000000> /* Maybe null pointer? */
RETX: <0x00000480> /* Maybe fixed code section */
RETS: <0x000c4710> { _prio_tree_insert + 0x168 }
PC : <0x000c44a2> { _prio_tree_replace + 0x1a }
DCPLB_FAULT_ADDR: <0x00000000> /* Maybe null pointer? */
ICPLB_FAULT_ADDR: <0x000c44a2> { _prio_tree_replace + 0x1a }
The detailed log is attached.
crash2
QuoteReplyEditDelete
2010-10-04 12:09:03 Re: crash busybox process?
Mike Frysinger (UNITED STATES)
Message: 94171
that doesnt really look like a crash in busybox. looks more like the kernel crashing. a crash in userspace should not break the kernel.
you could enable the MPU to see if that catches something.
QuoteReplyEditDelete
2010-10-05 05:22:46 Re: crash busybox process?
Rob Maris (GERMANY)
Message: 94188
MPU is stated as to primarily use while debugging. More information is not available (also not on the docs resources). When any null pointer exception would deliver more info compared to the exception log, this would help. But I have another problem:
CC arch/blackfin/mach-common/arch_checks.o
arch/blackfin/mach-common/arch_checks.c:51:3: error: #error the MPU will not function safely while Anomaly 05000263 applies
make[1]: *** [arch/blackfin/mach-common/arch_checks.o] Error 1
I get crashes with e.g. following characteristics (regardless busybox 1.16 or 1.17):
COMM=busybox PID=1374 CPU=0
DCPLB_FAULT_ADDR: <0x00000000> /* Maybe null pointer? */
ICPLB_FAULT_ADDR: <0x000c44a2> { _prio_tree_replace + 0x1a }
6 Target : <0xffa007a0> { _trap + 0x0 }
FAULT : <0x000c44a2> { _prio_tree_replace + 0x1a } P2 = [P1]
Source : <0xffa00566> { _bfin_return_from_exception + 0xe } RTX
COMM=ping PID=4867 CPU=0
DCPLB_FAULT_ADDR: <0x00000004> /* Maybe null pointer? */
ICPLB_FAULT_ADDR: <0x00045d3e> { _vma_prio_tree_add + 0x5a }
or COMM=sh
Illegal use of supervisor resource
COMM=cron1.sh
DCPLB_FAULT_ADDR: <0x0067fe6c> /* kernel dynamic memory (maybe user-space) */
ICPLB_FAULT_ADDR: <0x00045c72> { _vma_prio_tree_add + 0x5a }
(instead of COMM=busybox as with 1.17.2)
I'm aware of it: this can also be "random". The hardware trace of these crashes look similar (as before).
Nevertheless, when it is a kernel crash, it looks like the crash occurs when shell commands are executed from cron scripts (which indeed use e.g. ping) that do some network queries every minute (no other app is active). Seen from this perspective, the shell isn't guilty about the exceptions. Or does it make sense to try re-establishing msh? (BTW: busybox msh can be opted for, but actually calls hush).
Regarding the ICPLB-addresses: confusing is the apparent reference to function symbols, not to souce file names, as is the case in the example from docs page "Analyzing Traces".
QuoteReplyEditDelete
2010-10-05 14:40:22 Re: crash busybox process?
Mike Frysinger (UNITED STATES)
Message: 94197
considering you're debugging a crash, sounds like enabling the MPU makes perfect sense. do you not have a newer version of silicon to test things one ?
msh is dead. it isnt coming back. if busybox 1.16 is crashing too, then i dont think busybox is the problem. but based on your other posts, you're apparently enabling a lot of busybox utils no one else tests.
how are you talking to the board ? UART console ? over ethernet/telnet ?
QuoteReplyEditDelete
2010-10-05 15:19:17 Re: crash busybox process?
Rob Maris (GERMANY)
Message: 94198
msh: no problem, it could have been an opportunity. I think the other stuff you mention applies, anyway. Probably the crash (once in a couple of hours) is indeed arising because I'm working with some busybox utils no one tests. However, truly not that much busybox utils are enabled. I'll add a user .config for you.
MPU: I currently have 0.3 silicons. MPU builds without errors..
Booting is OK, mpg321 streaming now goes from approx. 17 to 23 % CPU.
What should I expect when MPU detects violations?
_config
QuoteReplyEditDelete
2010-10-05 15:33:45 Re: crash busybox process?
Rob Maris (GERMANY)
Message: 94199
oh, not to forget: userland configuration is about the same as before, i.e. the production system based upon 2.6.28.9.
QuoteReplyEditDelete
2010-10-05 16:36:42 Re: crash busybox process?
Rob Maris (GERMANY)
Message: 94201
With MPU, two times I got
Kernel panic - not syncing: Attempted to kill init!
(and dead)
QuoteReplyEditDelete
2010-10-06 03:16:05 Re: crash busybox process?
Mike Frysinger (UNITED STATES)
Message: 94223
yes, you'll see loss of performance when the MPU is enabled due to handling of all the 4KiB page misses
when a violation occurs, you'll get the same behavior as under Linux with a MMU -- a segfault when an app tries to access memory that it has no privileges for. you'll probably also get a normal cplb miss/protection violation dump. but it'll be from the misbehaving app rather than some random corrupted location.
QuoteReplyEditDelete
2010-10-06 03:17:19 Re: crash busybox process?
Mike Frysinger (UNITED STATES)
Message: 94231
you need to provide some real details ...
are you using the busybox init ? do you see this at boot ? or during runtime ? what is in the kernel log buffer ?
the system being dead is expected behavior with process #1 exits.
QuoteReplyEditDelete
2010-10-06 04:52:00 Re: crash busybox process?
Rob Maris (GERMANY)
Message: 94252
No, I don't use the busybox init - didn't work. The kernel log buffer (boot) is attached.
At this time I have reverted to system for silicon 0.2 and MPU off again. Toolchain changed to a 4.3.5 trunk (3 months ago) to 2010R1-RC3. Further getting "conservative" by switching from SLOB to SLAB (SLOB was first choice, also in production system - however yields compiler warning - __alignof__ seems to deliver pointer type).
At this time I'm investigating a new problem:
rob@rob-desktop:~$ telnet tcm-bf537
Trying 192.168.1.72...
Connected to tcm-bf537.fritz.box.
Escape character is '^]'.
telnetd: All network ports in use.
Connection closed by foreign host.
Curious, I'm having to find out what change caused this new trouble.
log2634-7
QuoteReplyEditDelete
2010-10-06 05:16:40 Re: crash busybox process?
Mike Frysinger (UNITED STATES)
Message: 94254
usually that means you've incorrectly disabled UNIX98_PTYS in the kernel or you didnt mount devpts on /dev/pts
or you're not using the default telnetd everyone else uses
QuoteReplyEditDelete
2010-10-06 06:04:36 Re: crash busybox process?
Rob Maris (GERMANY)
Message: 94259
Is this a subtle hint that you've read my userspace .config? Okay, I recently have tried busybox telnetd (which is the alternative choice), but this doesn't generate an entry in /etc/inetd.conf. But: until yesterday telnetd responded correctly, without any changes in relevant .config details. That's why it's curious. My intention was to tell that it has priority now over the crash issue.
QuoteReplyEditDelete
2010-10-06 06:24:29 Re: crash busybox process?
Rob Maris (GERMANY)
Message: 94263
Now I have ensured that the git commit that corresponds to a correctly operating telnetd is restored - still error. Thus a make clean must have did it. Which also means that the crash problems may also be related to some unsane stuff in the build prior to clean.
QuoteReplyEditDelete
2010-10-06 06:31:02 Re: crash busybox process?
Rob Maris (GERMANY)
Message: 94264
.... and of course, I restored the trunky compiler. To be precise: gcc version 4.3.5 (ADI-trunk/git-08d8861) - BTW: yields 0.1 MB smaller image than gcc version 4.3.5 (ADI-2010R1-RC3).
QuoteReplyEditDelete
2010-10-06 08:52:11 Re: crash busybox process?
Rob Maris (GERMANY)
Message: 94271
telnet problem solved:
CONFIG_LEGACY_PTYS ("Legacy (BSD) PTY support") must be active.
Still curious: in deactive state it has operated before. I tried to reconstruct it in order to find out if an error is present in the set of git versioned files (I have versioned .configs). Proof: PTY set active, make (telnet OK), then git reset --hard, make (without again entering config menu). And -- the telnet error is present again. Hence, no SCM-related problem.
Mike: "you're not using the default telnetd everyone else uses". Who is everyone else?, but more to the point: what do you consider the default telnetd? The busybox one?
Regarding the crash problem - I think it makes sense that I go again for toolchain 2010R1-RC3 and return here when the system still crashes. BTW, I don't find toolchain git commit 08d8861 - to what SVN does it correspond?
QuoteReplyEditDelete
2010-10-06 10:00:46 Re: crash busybox process?
Mike Frysinger (UNITED STATES)
Message: 94276
legacy pty support is the exact thing you dont want enabled. your userspace config wrongly has CONFIG_USER_TELNETD_DOES_NOT_USE_OPENPTY enabled.
the vast majority of people are not creative. they use whatever ADI uses in their configuration files.
QuoteReplyEditDelete
2010-10-06 10:53:06 Re: crash busybox process?
Rob Maris (GERMANY)
Message: 94284
Hm, politicians also tend to be generic in their statements. I'm not expected to be creative in this issue, since much details of big buildings is a black box, so you'd appreciate it to have documentation. Where ADI generally makes a very good job.
And, er... once again: what about preferred telnetd: busybox or no busybox?.
Back to the facts: No. Deactivating userspace USER.....OPENPTY while deactivating the kernel legacy PTY too does not work. Remind, legacy_pty had been off before. And the following comment is given with the userspace setting:
Force telnetd to use its own internal method of opening a pty,
rather than relying upon libc's openpty(). This is included as
a work-around to file permission issues when using uClibc and
ROMFS filesystems.
I'm sorry to pollute this thread with an issue not related to the original issue. I had considered to put it elsewhere, namely with the only thread in this space where "All network ports in use" is issued. But this thread has been closed because of "old issue". The truth is that that issue was polluted to, and aother guy was going to add something to it dat fitted to the title - the motivation was fully OK!
QuoteReplyEditDelete
2010-10-06 11:06:37 Re: crash busybox process?
Mike Frysinger (UNITED STATES)
Message: 94288
the default ADI configs show what telnetd we use and a simple grep tells you the answer:
$ grep -h TELNETD.*=y vendors/AnalogDevices/*/* | sort -u
CONFIG_USER_TELNETD_TELNETD=y
you need to rebuild telnetd after changing these options ... it wont rebuild itself
and as the comment you already quoted says -- this option is a workaround hack that shouldnt be enabled in an otherwise correctly configured system. every ADI board gets by just fine without it.
QuoteReplyEditDelete
2010-10-06 11:39:14 Re: crash busybox process?
Rob Maris (GERMANY)
Message: 94290
Yes, a make user/telnetd_clean followed by make did it OK. Now I also can fully explain the intermediate trouble with telnetd. It must have been the make clean by start of my working day which reconfigured telnet such that it did no longer operate OK. A few weeks ago I had some trouble with ftpd and at that time I considered to use more busybox features (also telnetd). This did not succeed, and apparently, upon state restoring, a not cleaning a module resulted in a build that did not exactly reflect the config state.
As a consequence I realize that not every change in .config results in proper build results under all circumstances. Until now, I made the experience that some changes result in automatic build all, while most do appropriate builds related to changed parameters. Is there any generic policy in build steps related to changes performed in .config?
QuoteReplyEditDelete
2010-10-06 13:20:53 Re: crash busybox process?
Mike Frysinger (UNITED STATES)
Message: 94293
the upstream policy is that you need to `make clean` whenever you change config options. a lot of work has been done in the Blackfin fork to have things rebuild as necessary, but a lot of packages still retain the upstream behavior. things are improved over time, but there are no plans to audit the whole tree and proactively fix things.