Our report from the incident on 05/04/2012 is as follows.
Issue
sms-sagat unresponsive
Underlying cause
Memory page fault caused a kernel panic
Symptoms
Complete loss of service on sms-sagat
Resolution
- After detecting the server was down, the machine’s serial console output was reviewed to show a kernel panic.
- The system was powered down, memory re-seated, and powered up into a rescue environment to run memtest+
- Memtest completed 1 pass without error
- Server was powered back on into normal run level
Continual memory tests are running on the system, but so far have shown without error. It is assumed it was a software fault (not hardware).
The RAID array is also degraded and being re-built, so performance is limited.
–
Follow Up
A SMART test was run on all drives and one drive reported bad sectors. As a result, this drive has been removed and replaced and the RAID array is rebuilding. An off-line snapshot has been taken of the system whilst the RAID array is degraded.