Saturday 5th May 2012

Network 05/04/2012 downtime explaination

Our report from the incident on 05/04/2012 is as follows.
 
Issue

sms-sagat unresponsive

Underlying cause

Memory page fault caused a kernel panic

Symptoms

Complete loss of service on sms-sagat

Resolution

  1. After detecting the server was down, the machine’s serial console output was reviewed to show a kernel panic.
  2. The system was powered down, memory re-seated, and powered up into a rescue environment to run memtest+   
  3. Memtest completed 1 pass without error
  4. Server was powered back on into normal run level

Continual memory tests are running on the system, but so far have shown without error. It is assumed it was a software fault (not hardware).

The RAID array is also degraded and being re-built, so performance is limited.

Follow Up

A SMART test was run on all drives and one drive reported bad sectors. As a result, this drive has been removed and replaced and the RAID array is rebuilding. An off-line snapshot has been taken of the system whilst the RAID array is degraded.