Friday 2nd May 2014

Network Issue with sms-sagat

sms-sagat has stopped responding, initial indications look like an issue with the RAID controller. Updates to follow.

  • Update (09:54): The server was restarted, and is currently running a disk consistency check to verify there is no damage to the file system. This could take up to an hour to complete. It appears that a single drive dropped out of the RAID array and (currently for reasons unknown), the file-system mounted as read-only as a precautionary measure.
  • Update (10:05): The automatic file system check has failed, and a manual check is now being run.
  • Update (10:42): The file system check failed again and the root cause is a failing disk drive. The failed drive has been replaced and a disk check is running once again.
  • Update (11:07): The file system check is currently at 45% - estimates are that it will complete in 43 minutes (if successful).
  • Update (11:20): The automatic system check has failed again, which indicates a failure of both disk drives, or the physical backplane the drives are attached too or a RAM fault with the machine. The server will relocated into another blade chassis with new RAM to rule out backplane and RAM failure.
  • Update (12:16): The remaining working drive has been re-located into another chassis and a file system check is running again. The other drive removed earlier today is confirmed as completely failed and data irrecoverable. Estimates at this point indicate at least another 60 minutes of downtime.
  • Update (13:06): The disk check has completed with a few minor errors. The disk is being cloned prior to a RAID rebuild (in case of a drive failure during rebuild). After which, the server will be powered on and RAID rebuild can begin.
  • Update (14:03): Unfortunately further errors have been thrown during the disk clone, so further time must be spend recovering the data. Estimates stand at least 2 hours from this point.
  • Update (14:59): The second to last step of recovery is underway, currently progress is at 7%.
  • Update (15:08): Progress is currently at 12.5% - ETA for full fix is 4 hours.
  • Update (16:00): The recovery step has again encountered a further error, with a stream of access errors from the disk drive itself. Data recovery is looking ever less likely at this stage.
  • Update (16:25): We have managed to mount the (damaged) file system now and are attempting to copy files from disk to disk. After which, the server should be bootable (assuming there are no damaged files). Progress during the copy is relatively quick - we’ll calculate an ETA once we’ve got a few minutes of statistics.
  • Update (19:15): Data recovery is still continuing, but slow. With 400GB of data to copy speed is jumping between 1MB/s and 80MB/s (due to failing source drive). Completion time is unknown - but we are confident for a full restoration.
  • Update (21:12): Almost 100% of data has been recovered and copied onto a fresh pair of drives, to form a new RAID array. We anticipate the server being up within the next 2 hours - however, performance will be severely limited during the RAID rebuild.
  • Update (02:52): The server has been able to successfully boot, MySQL checks are running on all DBs to ensure integrity, followed by the restoration of the remaining files to the machine. 
  • Update (08:31): All checks have passed, all customer data has been successfully recovered - and the server is now up. Customer’s may experience some performance issues whilst the RAID array continues to rebuild. As we cannot check each customer store, customer’s are encouraged to contact us if they experience any issues.

Post-Mortem

Our report from the incident is as follows.

Issue

Total service outage on sms-sagat

Outage Length

27 hours and 30 minutes.

Underlying cause

A HDD completely failed in the server leaving the RAID array degraded. However, the remaining drive immediately then began showing signs it was failing, resulting in the server mounting the filesystem as read-only as a precautionary measure against data corruption.

Symptoms

Our internal and external monitoring probes did not report a fault. The machine was still responding to ICMP and HTTP tests, some customer websites appeared to be browsable, but the read-only files ystem meant that they were effectively unusable.

Resolution

We immediately removed the failed drive and replaced it. We attempted several times to check the filesystem health and copy the data from the failing drive to the new RAID array, eventually resulting in 100% custom data retrieval.

Retrieving the data took a considerable amount of time and was the primary cause of such an extended outage. As we do not operate off-site backups on shared hosting, it meant that this absolutely necessary to recover customer data.

Also a secondary reason for the downtime duration was that our monitoring platform failed to identify an issue, given the machine was responding to healthcheck tests in a normal fashion.

The servers automatically run daily and weekly HDD self- tests, but no early warning signs had been issued, so we had no indication this was about to happen.

As a result of this incident, we will

  1. Provide off-site backups for shared hosting
  2. Add additional healthchecks for a read-only disk scenario