Symptom:
The host CPU load is gradually ramping upThe Load avg is not so high around 20 comparing that we have 72 CPUs
Use top command, you can't find out the obvious top CPU consumers
Eventually OS crashes and reboot itself.
Diagnosis:
From OS logs, it reports dm device hang and MegaRAID SAS controller dead and reset
Error from OS messages like
Apr 23 02:32:00 host1 kernel: [160668.787368] INFO: task jbd2/dm-0-8:1513
blocked for more than 120 seconds.
Apr 23 02:32:00 host1 kernel: [160668.795203] Tainted: P
OE 4.1.12-94.7.8.el6uek.x86_64 #2
Apr 23 02:32:00 host1 kernel: [160668.803039] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this messag
Apr 23 02:32:43 host1 kernel: [160711.104126] sd 0:2:0:0: [sda] tag#138
megasas: target reset FAILED!!
Apr 23 02:32:43 host1 kernel: [160711.104138] megaraid_sas 0000:23:00.0:
IO/DCMD timeout is detected, forcibly FAULT Firmware
Apr 23 02:32:43 host1 kernel: [160711.493900] megaraid_sas 0000:23:00.0:
Number of host crash buffers allocated: 512
We find there is LVM snapshot on DB home (/u01)
LV VG Attr LSize Pool Origin Data%
u01_backup VGExaDb swi-a-s--- 13.65g LVDbOra1 14.54
Data% is 14.54 . As /u01 has most of local disk writes , the snapshot slows the overall performance.
Eventually it is MegaRAID firmware fault.
No comments:
Post a Comment