2007-08-28 17:06:28 Kernel panic while exercising compact flash.
Greg Ferrari (UNITED STATES)
Message: 43516 First some background. I have a BF537 with a compact flash. It is connected to a network and is mounting a disk from another machine. The compact flash has three slices: hda1 contains the kernel image from which I boot, hda2 contains an ext2 filesystem and is mounted read/write, hda3 also contains an ext2 filesystem that I use during various tests. I am running with the 2007R1.1-RC3 kernel and toolchain.
I have a test script that exercises the compact flash to make sure it is working properly (there have been issues in the past). The script places a filesystem on hda3, mounts it to /mnt, copies a bunch of files of various sizes one at a time from the network drive to /mnt, unmounts /mnt, performs an MD5 checksum on hda1 (the slice containing my kernel image), and remounts hda3 to /mnt. At various points during the file copies a sync is performed. It does this in a loop forever or until I hit ctrl-C.
In previous versions of the kernel the MD5 checksum would change with each iteration even though hda1 was never modified. This issues appears to have been fixed recently in an SVN version of the kernel and the latest 2007R1.1-RC3 kernel. But I am now seeing kernel panics while running my script. Sometimes the kernel panics in the first iteration of my script, other times it makes it through several iterations before panicing. When it panics I've seen the dump show kswapd and cp as the process that caused problems. There are probably others. Here is some output from my latest test that didn't show a process at all but did indicate a "kernel BUG"...
Hardware Trace:
0 Target : <0x00004618> { _dump_stack + 0x0 }
Source : <0x00031b7a> { _free_pages_bulk + 0x1c6 }
1 Target : <0x00031b7a> { _free_pages_bulk + 0x1c6 }
Source : <0x00031a8c> { _free_pages_bulk + 0xd8 }
2 Target : <0x000319dc> { _free_pages_bulk + 0x28 }
Source : <0x00031b20> { _free_pages_bulk + 0x16c }
3 Target : <0x00031ae4> { _free_pages_bulk + 0x130 }
Source : <0x00031bac> { _free_pages_bulk + 0x1f8 }
4 Target : <0x00031ba8> { _free_pages_bulk + 0x1f4 }
Source : <0x00031a5a> { _free_pages_bulk + 0xa6 }
5 Target : <0x000319dc> { _free_pages_bulk + 0x28 }
Source : <0x00031b20> { _free_pages_bulk + 0x16c }
6 Target : <0x00031ae4> { _free_pages_bulk + 0x130 }
Source : <0x00031bac> { _free_pages_bulk + 0x1f8 }
7 Target : <0x00031ba8> { _free_pages_bulk + 0x1f4 }
Source : <0x00031a5a> { _free_pages_bulk + 0xa6 }
8 Target : <0x000319dc> { _free_pages_bulk + 0x28 }
Source : <0x00031b20> { _free_pages_bulk + 0x16c }
9 Target : <0x00031ae4> { _free_pages_bulk + 0x130 }
Source : <0x00031bac> { _free_pages_bulk + 0x1f8 }
10 Target : <0x00031ba8> { _free_pages_bulk + 0x1f4 }
Source : <0x00031a5a> { _free_pages_bulk + 0xa6 }
11 Target : <0x000319dc> { _free_pages_bulk + 0x28 }
Source : <0x00031b20> { _free_pages_bulk + 0x16c }
12 Target : <0x00031ae4> { _free_pages_bulk + 0x130 }
Source : <0x00031bac> { _free_pages_bulk + 0x1f8 }
13 Target : <0x00031ba8> { _free_pages_bulk + 0x1f4 }
Source : <0x00031a5a> { _free_pages_bulk + 0xa6 }
14 Target : <0x000319b4> { _free_pages_bulk + 0x0 }
Source : <0x00031c8a> { _free_hot_cold_page + 0xce }
15 Target : <0x00031c7e> { _free_hot_cold_page + 0xc2 }
Source : <0x00031c62> { _free_hot_cold_page + 0xa6 }
Stack from 00f73c44:
ffa010da 00031b7e 001b3dd0 00000001 002fb960 0000001b 0002cb84 00000004
00178f24 00031c8e 00145a34 0027d770 00f73d88 0000ffff 00145a18 00180c8c
00f73d5c 0008b01e 00000007 00145a40 00000000 00031cca 00f73cf8 00f73cd8
00000002 00000009 002ffee0 002f6c10 00000000 0002cc92 0027d798 00f73d18
00000009 00000009 00000000 00000000 00000000 00000009 00000000 001ad278
002851a0 002851c8 002a5cc0 002a5ce8 001ad930 001ad958 0027d770 0027d798
Call Trace:
[<0005037c>] _try_to_free_buffers+0x44/0xa0
[<0002d0c0>] ___pagevec_release+0x18/0x24
[<0002d834>] _truncate_inode_pages_range+0x1d4/0x2b8
[<0002d92a>] _truncate_inode_pages+0x12/0x1c
[<00046034>] _dispose_list+0x8c/0xa8
[<0004628a>] _invalidate_inodes+0x96/0xac
[<0003730c>] _generic_shutdown_super+0x5c/0x114
[<000373de>] _kill_block_super+0x1a/0x2c
[<000479c0>] _mntput_no_expire+0x34/0x84
[<0003747e>] _deactivate_super+0x42/0x5c
[<00037484>] _deactivate_super+0x48/0x5c
[<00048b86>] _sys_umount+0x112/0x1a4
[<00048aa2>] _sys_umount+0x2e/0x1a4
[<0004b07c>] _seq_release+0x18/0x24
[<00002000>] _arch_ptrace+0x8c/0x460
[<00034114>] _filp_close+0x40/0x68
[<00034114>] _filp_close+0x40/0x68
[<0003874c>] _sys_readlink+0x1c/0x24
[<00048a74>] _sys_umount+0x0/0x1a4
[<00134380>] _init_currently_empty_zone+0xc/0x130
[<00134380>] _init_currently_empty_zone+0xc/0x130
[<00008000>] _irq_panic+0x4c/0x144
[<00001000>] __start+0x0/0x164
kernel BUG at mm/page_alloc.c:362!
Kernel panic - not syncing: BUG!
Here's another panic that occurred in the sixth iteration of my loop. In this case it shows cp as being the culprit and the output continuously displays. It looks like the kernel got stuck in an infinite loop displaying the panic.
NULL pointer access (probably)
CURRENT PROCESS:
COMM=cp PID=836
TEXT = 0x00500040-0x0054a3a0 DATA = 0x0054a3a4-0x0055dd04
BSS = 0x0055dd04-0x00563ed4 USER-STACK = 0x0057fd30
return address: 0x00028730; contents of [PC-16...PC+8]:
RETE: 00000000 RETN: 0601fc40 RETX: 00028730 RETS: 00028690
IPEND: 8030 SYSCFG: 0036
SEQSTAT: 00060027 SP: 0601fb64
R0: 00000000 R1: 0000ffff R2: 00000000 R3: 0000003f
R4: 00001000 R5: 0601fc8c R6: 0093f1d0 R7: 0000000d
P0: 063f1468 P1: 00000020 P2: 00000000 P3: 00145a18
P4: 063f1450 P5: 063f1470 FP: 00178f24
A0.w: 00000000 A0.x: 00000000 A1.w: 00000000 A1.x: 00000000
LB0: ffa01e88 LT0: ffa01e88 LC0: 00000000
LB1: 00050b30 LT1: 00050b2c LC1: 00000000
B0: cccccccd L0: 00000000 M0: 00000001 I0: 0601fab8
B1: 009ff9b8 L1: 00000000 M1: 00000000 I1: 00000000
B2: 00000001 L2: 00000000 M2: 00180c68 I2: 00000000
B3: 0704bbb0 L3: 00000000 M3: 00000001 I3: 00000001
USP: 0057ea50 ASTAT: 02003024
DCPLB_FAULT_ADDR=00000004
ICPLB_FAULT_ADDR=00028730
Hardware Trace:
0 Target : <0x000046d4> { _trap_c + 0x0 }
Source : <0xffa00b3c> { _exception_to_level5 + 0xb4 }
1 Target : <0xffa00a88> { _exception_to_level5 + 0x0 }
Source : <0xffa00a86> { _ex_trap_c + 0x4e }
2 Target : <0xffa00a38> { _ex_trap_c + 0x0 }
Source : <0xffa00bdc> { _trap + 0x28 }
3 Target : <0xffa00bb4> { _trap + 0x0 }
Source : <0xffa01142> { __common_int_entry + 0xd8 }
4 Target : <0xffa010e0> { __common_int_entry + 0x76 }
Source : <0xffa00ea8> { _return_from_int + 0x58 }
5 Target : <0xffa00ea8> { _return_from_int + 0x58 }
Source : <0xffa00e7e> { _return_from_int + 0x2e }
6 Target : <0xffa00e50> { _return_from_int + 0x0 }
Source : <0xffa010dc> { __common_int_entry + 0x72 }
7 Target : <0xffa010da> { __common_int_entry + 0x70 }
Source : <0xffa0048a> { _asm_do_IRQ + 0x76 }
8 Target : <0xffa00482> { _asm_do_IRQ + 0x6e }
Source : <0x00010d2e> { __local_bh_enable + 0x3e }
9 Target : <0x00010cf0> { __local_bh_enable + 0x0 }
Source : <0x00010fa8> { ___do_softirq + 0x94 }
10 Target : <0x00010fa0> { ___do_softirq + 0x8c }
Source : <0x00010f80> { ___do_softirq + 0x6c }
11 Target : <0x00010f72> { ___do_softirq + 0x5e }
Source : <0x000be4e4> { _net_rx_action + 0xa0 }
12 Target : <0x000be4b4> { _net_rx_action + 0x70 }
Source : <0x000be440> { _process_backlog + 0x120 }
13 Target : <0x000be432> { _process_backlog + 0x112 }
Source : <0x000be42c> { _process_backlog + 0x10c }
14 Target : <0x000be3e8> { _process_backlog + 0xc8 }
Source : <0x000be36e> { _process_backlog + 0x4e }
15 Target : <0x000be368> { _process_backlog + 0x48 }
Source : <0x000be3c8> { _process_backlog + 0xa8 }
Stack from 0601fb44:
00000008 ffa00b40 0014270c 0014270c 0014270c 0087a858 00050490 061528e0
00028730 00008030 00060027 00000000 0601fc40 00028730 00028730 00028690
00000000 02003024 00050b30 ffa01e88 00050b2c ffa01e88 00000000 00000000
00000000 00000000 00000000 00000000 0704bbb0 00000001 009ff9b8 cccccccd
00000000 00000000 00000000 00000000 00000001 00180c68 00000000 00000001
00000001 00000000 00000000 0601fab8 0057ea50 00178f24 063f1470 063f1450
Call Trace:
[<00001000>] __start+0x0/0x164
[<0000ffff>] _do_setitimer+0x25b/0x520
[<00051708>] ___block_prepare_write+0x270/0x420
[<00029922>] _generic_file_buffered_write+0x3b2/0x52c
[<00001000>] __start+0x0/0x164
[<000283ec>] _sync_page+0x2c/0x34
[<00001000>] __start+0x0/0x164
[<00001000>] __start+0x0/0x164
[<00001000>] __start+0x0/0x164
[<00103c24>] _ext2_aops+0x0/0x38
[<00001000>] __start+0x0/0x164
[<00028d5c>] _do_generic_mapping_read+0x334/0x444
[<00010bdc>] _current_fs_time+0x38/0x40
[<00001000>] __start+0x0/0x164
[<00029c72>] ___generic_file_aio_write_nolock+0x1d6/0x400
[<00001000>] __start+0x0/0x164
[<00001000>] __start+0x0/0x164
[<00001000>] __start+0x0/0x164
[<0002a59e>] _generic_file_aio_read+0xa6/0x18c
[<00029ef0>] _generic_file_aio_write+0x54/0xe0
[<000357c4>] _do_sync_write+0xb0/0xe8
[<00001000>] __start+0x0/0x164
[<0001d530>] _autoremove_wake_function+0x0/0x30
[<0001d530>] _autoremove_wake_function+0x0/0x30
[<00001000>] __start+0x0/0x164
[<00001000>] __start+0x0/0x164
[<00001000>] __start+0x0/0x164
[<00035fc6>] _vfs_write+0x82/0x14c
[<00001000>] __start+0x0/0x164
[<00036532>] _sys_write+0x32/0x64
[<00001000>] __start+0x0/0x164
[<00036500>] _sys_write+0x0/0x64
[<00001000>] __start+0x0/0x164
[<00008000>] _irq_panic+0x4c/0x144
[<00001000>] __start+0x0/0x164
[<00001000>] __start+0x0/0x164
[<00001000>] __start+0x0/0x164
I suspect that my kernel panics are not really related to exercising the compact flash, but are really caused by some memory management problems in the kernel. Copying all these files around is just a fast way of running into the bug. I also see similar kernel panics with some of my applications that consume relatively large amounts of memory. These applications run fine on my desktop machine and don't cause valgrind any heartburn. I was not running any of my own applications during the tests mentioned above however.
Does anyone out there have any idea what could be going wrong? Any advise on how to better diagnose the problem?
QuoteReplyEditDelete
2007-08-28 20:03:23 Re: Kernel panic while exercising compact flash.
Robin Getz (UNITED STATES)
Message: 43521 Greg:
The only major difference that I can think of in the new R1.1 is the page cache limiter, Can you try various values (100, 50, 10) in /proc/sys/vm/pagecache_ratio to check?
Thanks
-Robin
QuoteReplyEditDelete
2007-08-28 20:04:55 Re: Kernel panic while exercising compact flash.
Robin Getz (UNITED STATES)
Message: 43522 Greg:
If you could detail a little more of your tests, we can try to replicate the failure & ensure we don't run into the same things in the future.
-Robin
QuoteReplyEditDelete
2007-08-29 09:53:58 Re: Kernel panic while exercising compact flash.
Greg Ferrari (UNITED STATES)
Message: 43570 Robin:
It's funny that you mention the page cache limiter. I did indeed turn that on in this release because I was seeing my applications running out of memory. After investigating the cause I discovered that the kernel was holding on to memory way to aggressively and starving my applications. So I turned on the limiter, set the total page cache limit to 10MB, and set the page cache limit for every file to 32 pages. That solved my problem.
I tried the three page cache ratio values you suggested by editing /proc/sys/vm/pagecache_ratio file directly (I hope that was right). I still saw kernel panics.
I have attached two files with this reply. The first is the script I use to test the CF and cause kernel panics. The second is a text file with details of my environment.
Please let me know if you are successful at recreating the problem. I will continue to try and isolate the problem to a very small and specific test case. If there is anything else you would like me to try, please let me know.
Thanks for your help.
Greg
cfTest.txt
cfTest.adi
QuoteReplyEditDelete
2007-08-29 11:26:32 Re: Kernel panic while exercising compact flash.
Greg Ferrari (UNITED STATES)
Message: 43578 Robin:
Here's a simpler version of my script that is still causing kernel panics. In this version I have removed all of the file copies from my NFS disk and replaced it with a loop that copies all the files in /bin to my CF. This eliminates all the NFS traffic and focuses on writing to CF. I got the same "kernel BUG" mentioned earlier.
Greg
cfTest.noNFS
QuoteReplyEditDelete
2007-08-29 14:57:59 Re: Kernel panic while exercising compact flash.
Greg Ferrari (UNITED STATES)
Message: 43582 Robin:
Here's even more information. I have attached the kernel and user configurations from 'make menuconfig' as well as the boot sequence of my development board. Note that I am running on a 128MB development board (part #ADZS-BF537-EZLITE) with silicon rev 0.3 and an ADI compact flash daughter card. Could the fact that I have 128MB instead of the more common 64MB have anything to do with it?
Greg
bootsequence.txt
user.config
kernel.config
QuoteReplyEditDelete
2007-09-28 11:44:22 Re: Kernel panic while exercising compact flash.
Michael Hennerich (GERMANY)
Message: 44692 Greg,
[*] Limit page caches
(10) Total page cache limit in MB unit
(32) Page cache limit for every file in page unit
You should NOT set this option.
Its buggy and causing the issues that you are seeing.
Simply do.
root:~> echo 10 > /proc/sys/vm/pagecache_ratio
I also noticed that you are running SCLK @ 100MHz.
This can also cause issues with some CF cards.
I also checked some patches for IDE driver into svn.
(svn trunk and branch 2007R1)
Your test script runs now since hours on two boards without any faults.
Please update your kernel sources.
-Michael
QuoteReplyEditDelete
2008-01-30 21:32:51 Re: Kernel panic while exercising compact flash.
Wang Xiang (CHINA)
Message: 50408
Hello,Michael
1) Could you tell me how to switch off the set below?
[*] Limit page caches
(10) Total page cache limit in MB unit
(32) Page cache limit for every file in page unit
I do not know where the set is in menuconfig?
2) I can not find the pagecache_ratio in /proc/sys/vm; Which should be set in menuconfig?
Regards
Wang Xiang
QuoteReplyEditDelete
2008-01-31 03:13:00 Re: Kernel panic while exercising compact flash.
Yi Li (CHINA)
Message: 50417 > 1) Could you tell me how to switch off the set below?
These options does not exist in newer kernel (e.g, linux-2.6.22 and later).
> 2) I can not find the pagecache_ratio in /proc/sys/vm; Which should be set in menuconfig?
Please use newer release- e.g, 2008R1 is using linux-2.6.22.
Here is mine:
root:/> ls /proc/sys/vm/
block_dump nr_pdflush_threads
dirty_background_ratio overcommit_memory
dirty_expire_centisecs overcommit_ratio
dirty_ratio page-cluster
dirty_writeback_centisecs pagecache_ratio
drop_caches panic_on_oom
laptop_mode percpu_pagelist_fraction
lowmem_reserve_ratio swappiness
min_free_kbytes vfs_cache_pressure