Sonassi

No incidents reported

Network Network connectivity fault

Some ISPs are reporting connectivity issues.

Update (11:56): One of our transit providers (Cogent) have suffered a total router failure in Manchester, causing 100% traffic loss for the routes out of that network.
Update (11:58): Our routers automatically removed Cogent from the routing pool and traffic is flowing over other carriers. Failover was automatic and instantaneous, BGP route updates converged within around 3 minutes as expected.
Update (12:51): Cogent confirmed a line card failed in a core router in Manchester which lead to subsequent packet loss. The failed line card has been replaced and service is 100% restored. Our routers have once again begun flowing traffic over Cogent.

Post-Mortem

Our report from the incident is as follows.

Issue

Minor network outage

Outage Length

3 seconds

Underlying cause

One of our transit providers (Cogent) experienced a router failure within their network. Increasing CPU usage on their core router caused packets to be progressively dropped.

Symptoms

Our external monitoring probes immediately reported the fault. Some customers (whose traffic was routed over Cogent), experienced an extremely brief window (<1 minute) of slow page load times or server inaccessibility.

Resolution

Once the packet loss threshold was hit, our internal BGP latency and packet loss measuring device automatically de-preferenced Cogent from the available BGP routes. Once Cogent was removed, traffic continued to flow out over our remaining carriers as normal.

Convergence took <5 seconds, but propagation at other ISPs may have taken a couple of minutes, which is why some customers may have experienced a slightly longer outage.

Our automated systems and monitoring systems behaved exactly as designed for this disaster scenario and recovered the carrier failure in less than 5 seconds.

No incidents reported

Network Issue with sms-sagat

sms-sagat has stopped responding, root partition has been remounted read-only.

Update (09:30): The server has encountered the same issue 10 days ago. The root partition has been mounted read only. The machine has been rebooted and a filesystem check is under way. Currently at 13%.
Update (09:54): Automatic fsck has failed, manual fsck has failed. Again, a drive is showing as failing - running SMART test on affected drive now.
Update (11:04): We have taken the drives out of this chassis and installed them into a completely new chassis (to rule out backplane/motherboard failure). During a filesystem check, the drive indicated as failing is once again showing the same symptoms of failure - making the file system check almost impossible. Updates to follow.
Update (12:00): We are attempting a manual fsck on the filesystem. Should this fail again - we intend to let the RAID array rebuild, then removed the failed disk, then attempt a fsck on the remaining healthy disk. Bad blocks, that are unable to be reallocated, on the failing disk are causing the fsck to fail.
Update (12:06): The fsck is at 66.3%
Update (12:09): The fsck is at 80.2%
Update (12:13): The fsck completed without error and the server is rebooting.
Update (12:25): The server has booted, however, we need to let the RAID rebuild complete - so that the failing disk can be removed as quickly as possible. Maintenance pages will be put up on customer sites to expedite the RAID rebuild process.
Update (13:48): The RAID rebuild is currently at 11%. ETA 155 mins.
Update (14:11): The RAID rebuild is currently at 25%. ETA 135 mins.
Update (14:47): The RAID rebuild is currently at 41%. ETA 114 mins.
Update (15:08) The RAID rebuild is currently at 47%. ETA 173 mins.
Update (15:24): The RAID rebuild is currently at 52%. ETA 185 mins.
Update (16:06): The RAID rebuild is currently at 73%. ETA 45 mins.
Update (18:31): The RAID rebuild is currently at 84%. ETA 19 mins.
Update (18:45): The RAID rebuild has failed. The issue is stemming from bad sectors on the failing disk causing the RAID rebuild operation to reach around 95% before halting the rebuild process and dropping the good drive out the array. So we are resorting to creating a new array and manually copying the data between the two - once complete, we will reboot the server to load the new root partition.
Update (19:21): The data copy is currently at 3%. ETA 1:26 mins.
Update (19:51): The data copy is currently at 35%. ETA 1:02 mins.
Update (20:31): The data copy is currently at 49%. ETA 1:16 mins.
Update (21:13): The data copy is currently at 66%. ETA 56 mins.
Update (21:51): The data copy is currently at 73%. ETA 54 mins.
Update (23:19): The data copy is complete, the server is being rebooted, a final data sync will take place before services are brought up.
Update (03:51): The data copy completed without data loss. However, the primary MySQL data file (ibdata1) was stored on a section of the failing disk drive with bad sectors. This means the file is unable to be properly copied. We have been able to start the MySQL daemon in a separate environment and are progressively dumping and importing databases one by one. We have yet to encounter an error yet, but it may be possible that the bad sector still may be hit whilst retrieving data. Once all DBs have been imported, we can remove the respective maintenance pages.
Update (05:04): Customer websites are now on-line.

Post-Mortem

Our report from the incident is as follows.

Issue

Total service outage on sms-sagat

Outage Length

23 hours and 3 minutes.

Underlying cause

A HDD developed bad sectors which caused the RAID array to fail. Counter-intuitively, the healthy drive was forced out of the array due to the faulty drive being able to replicate its data to it.

Symptoms

Our internal and external monitoring probes did not report a fault. The machine was still responding to ICMP and HTTP tests, some customer websites appeared to be browsable, but the read-only file system meant that they were effectively unusable.

Resolution

We immediately removed the drives from the server and installed them into a brand new server, to rule out power supply, motherboard or backplane failure.

We attempted to rebuild the RAID array in order to remove the failed drive, however this proved futile.

Instead, we were forced to create a new single disk RAID array, then recover the data from the failing disk to this one. We did encounter issues with the MySQL databases and this posed a concern that there may have been data loss. However, indications show data was retrieved without error. Retrieving the data took a considerable amount of time and was the primary cause of such an extended outage.

The likelihood of a brand new drive failing, within 10 days of 2 other drives failing is unprecedented - an almost impossible possibility. We have not completed testing the old chassis, but at this point - we can only assume a power related issue that could have been overvolting the drives and causing drive failure.

No incidents reported

Past Incidents

Thursday 15th May 2014

Wednesday 14th May 2014

Tuesday 13th May 2014

Monday 12th May 2014

Sunday 11th May 2014

Saturday 10th May 2014

Friday 9th May 2014