Monday 12th May 2014

Network Issue with sms-sagat

sms-sagat has stopped responding, root partition has been remounted read-only.

  • Update (09:30): The server has encountered the same issue 10 days ago. The root partition has been mounted read only. The machine has been rebooted and a filesystem check is under way. Currently at 13%.
  • Update (09:54): Automatic fsck has failed, manual fsck has failed. Again, a drive is showing as failing - running SMART test on affected drive now.
  • Update (11:04): We have taken the drives out of this chassis and installed them into a completely new chassis (to rule out backplane/motherboard failure). During a filesystem check, the drive indicated as failing is once again showing the same symptoms of failure - making the file system check almost impossible. Updates to follow.
  • Update (12:00): We are attempting a manual fsck on the filesystem. Should this fail again - we intend to let the RAID array rebuild, then removed the failed disk, then attempt a fsck on the remaining healthy disk. Bad blocks, that are unable to be reallocated, on the failing disk are causing the fsck to fail.
  • Update (12:06): The fsck is at 66.3%
  • Update (12:09): The fsck is at 80.2%
  • Update (12:13): The fsck completed without error and the server is rebooting.
  • Update (12:25): The server has booted, however, we need to let the RAID rebuild complete - so that the failing disk can be removed as quickly as possible. Maintenance pages will be put up on customer sites to expedite the RAID rebuild process.
  • Update (13:48): The RAID rebuild is currently at 11%. ETA 155 mins.
  • Update (14:11): The RAID rebuild is currently at 25%. ETA 135 mins.
  • Update (14:47): The RAID rebuild is currently at 41%. ETA 114 mins.
  • Update (15:08) The RAID rebuild is currently at 47%. ETA 173 mins.
  • Update (15:24): The RAID rebuild is currently at 52%. ETA 185 mins.
  • Update (16:06): The RAID rebuild is currently at 73%. ETA 45 mins.
  • Update (18:31): The RAID rebuild is currently at 84%. ETA 19 mins.
  • Update (18:45): The RAID rebuild has failed. The issue is stemming from bad sectors on the failing disk causing the RAID rebuild operation to reach around 95% before halting the rebuild process and dropping the good drive out the array. So we are resorting to creating a new array and manually copying the data between the two - once complete, we will reboot the server to load the new root partition.
  • Update (19:21): The data copy is currently at 3%. ETA 1:26 mins.
  • Update (19:51): The data copy is currently at 35%. ETA 1:02 mins.
  • Update (20:31): The data copy is currently at 49%. ETA 1:16 mins.
  • Update (21:13): The data copy is currently at 66%. ETA 56 mins.
  • Update (21:51): The data copy is currently at 73%. ETA 54 mins.
  • Update (23:19): The data copy is complete, the server is being rebooted, a final data sync will take place before services are brought up.
  • Update (03:51): The data copy completed without data loss. However, the primary MySQL data file (ibdata1) was stored on a section of the failing disk drive with bad sectors. This means the file is unable to be properly copied. We have been able to start the MySQL daemon in a separate environment and are progressively dumping and importing databases one by one. We have yet to encounter an error yet, but it may be possible that the bad sector still may be hit whilst retrieving data. Once all DBs have been imported, we can remove the respective maintenance pages.
  • Update (05:04): Customer websites are now on-line.

Post-Mortem

Our report from the incident is as follows.

Issue

Total service outage on sms-sagat

Outage Length

23 hours and 3 minutes.

Underlying cause

A HDD developed bad sectors which caused the RAID array to fail. Counter-intuitively, the healthy drive was forced out of the array due to the faulty drive being able to replicate its data to it.

Symptoms

Our internal and external monitoring probes did not report a fault. The machine was still responding to ICMP and HTTP tests, some customer websites appeared to be browsable, but the read-only file system meant that they were effectively unusable.

Resolution

We immediately removed the drives from the server and installed them into a brand new server, to rule out power supply, motherboard or backplane failure.

We attempted to rebuild the RAID array in order to remove the failed drive, however this proved futile.

Instead, we were forced to create a new single disk RAID array, then recover the data from the failing disk to this one. We did encounter issues with the MySQL databases and this posed a concern that there may have been data loss. However, indications show data was retrieved without error. Retrieving the data took a considerable amount of time and was the primary cause of such an extended outage.

The likelihood of a brand new drive failing, within 10 days of 2 other drives failing is unprecedented - an almost impossible possibility. We have not completed testing the old chassis, but at this point - we can only assume a power related issue that could have been overvolting the drives and causing drive failure.