User processes in D-state leads to a watchdog reset using Linux 2.6.24 and arm processor

Question

Most of the user space processes are ending up in D-state after the unit runs for around 3-4 days, the unit is running on ARM processor. From the top o/p we can see that processes that are in D-state are waiting on system calls "page_fault" and "squashfs_readpage". Utimately this leads to a watchdog reset. The processes that go into D-sate would take unusually long time to recover.

Following is the top o/p when the system ends up in trouble:

top - 12:00:11 up 3 days,  2:40,  3 users,  load average: 2.77, 1.90, 1.72
Tasks: 250 total,   3 running, 238 sleeping,   0 stopped,   9 zombie
Cpu(s): 10.0% us, 75.5% sy,  0.0% ni,  0.0% id, 10.3% wa,  0.0% hi,  4.2% si
Mem:    191324k total,   188896k used,     2428k free,     2548k buffers
Swap:        0k total,        0k used,        0k free,    87920k cached



 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND

 1003 root      20   0  225m  31m 4044 S 15.2 16.7   0:21.91 user_process_1             
 3745 root      20   0 80776 9476 3196 **D**  9.0  5.0   1:31.79 user_process_2                
  129 root      15  -5     0    0    0 S  7.4  0.0   0:27.65 **mtdblockd**          
 4624 root      20   0  3640  256  160 **D**  6.5  0.1   0:00.20 GetCounters_cus    
    3 root      15  -5     0    0    0 S  3.2  0.0  43:38.73 ksoftirqd/0        
31363 root      20   0  2356 1176  792 R  2.6  0.6  40:09.58 top                
  347 root      30  10     0    0    0 S  1.9  0.0  28:56.04 **jffs2_gcd_mtd3**     
 1169 root      20   0  225m  31m 4044 S  1.9 16.7  39:31.36 user_process_1             
  604 root      20   0     0    0    0 S  1.6  0.0  27:22.76 user_process_3           
 1069 root     -23   0  225m  31m 4044 S  1.3 16.7  20:45.39 user_process_1             
 4545 root      20   0  3640  564  468 S  1.0  0.3   0:00.08 GetCounters_cus    
   64 root      15  -5     0    0    0 **D**  0.3  0.0   0:00.83 **kswapd0**            
  969 root      20   0 20780 1856 1376 S  0.3  1.0  14:18.89 user_process_4             
  973 root      20   0  225m  31m 4044 S  0.3 16.7   3:35.74 user_process_1             
 1070 root     -23   0  225m  31m 4044 S  0.3 16.7  16:41.04 user_process_1             
 1151 root     -81   0  225m  31m 4044 S  0.3 16.7  23:13.05 user_process_1             
 1152 root     -99   0  225m  31m 4044 S  0.3 16.7   8:48.47 user_process_1

One more interesting observation is that when the system lands up in this problem, we can consistently see "mtdblockd" process running in the top o/p. We have swap disabled on this unit. there is no apparent memory leak in the unit.

Any idea what could be the possible reasons, the processes are stuck in D-sates?

Phillip Lougher · Accepted Answer

D-state means the processes are stuck in the kernel in a TASK_UNINTERRUPTIBLE sleep, this is unlikely to be bugs in the Squashfs error handling code because if a process exited Squashfs holding a mutex, the system would quickly grind to a halt as other processes entered Squashfs and slept forever waiting for the mutex. You would also see a low load average/system time as most processes would be sleeping. Furthermore there is no evidence Squashfs has hit any I/O errors.

Load average (2.77) and system time (75.5%) is extremely high, coupled with the fact a lot of processes are in Squashfs_readpage (which is completing but slow), indicates the system is thrashing. There is too little memory and the system is spending all it's time constantly (re-)demand paging pages from disk. This will account for the fact a lot of processes are in Squashfs_readpage, system time is extremely high because the system is spending most of its time in Squashfs in the CPU intensive task of decompression. The other processes are stuck in Squashfs waiting on the decompressor mutex (only one process can be decompressing at a time because the decompressor state is shared).

User processes in D-state leads to a watchdog reset using Linux 2.6.24 and arm processor

Tags:

caching

process

linux-kernel

block-device

jffs2

user2500217

1 Answers

Phillip Lougher

Recent Activity

Donate For Us

User processes in D-state leads to a watchdog reset using Linux 2.6.24 and arm processor

Tags:

caching

process

linux-kernel

block-device

jffs2

user2500217

1 Answers

Phillip Lougher

Related questions

Recent Activity

Donate For Us