No incidents reported
Some ISPs are reporting connectivity issues.
Post-Mortem
Our report from the incident is as follows.
Issue
Minor network outage
Outage Length
3 seconds
Underlying cause
One of our transit providers (Cogent) experienced a router failure within their network. Increasing CPU usage on their core router caused packets to be progressively dropped.
Symptoms
Our external monitoring probes immediately reported the fault. Some customers (whose traffic was routed over Cogent), experienced an extremely brief window (<1 minute) of slow page load times or server inaccessibility.
Resolution
Once the packet loss threshold was hit, our internal BGP latency and packet loss measuring device automatically de-preferenced Cogent from the available BGP routes. Once Cogent was removed, traffic continued to flow out over our remaining carriers as normal.
Convergence took <5 seconds, but propagation at other ISPs may have taken a couple of minutes, which is why some customers may have experienced a slightly longer outage.
Our automated systems and monitoring systems behaved exactly as designed for this disaster scenario and recovered the carrier failure in less than 5 seconds.
No incidents reported
sms-sagat has stopped responding, root partition has been remounted read-only.
Post-Mortem
Our report from the incident is as follows.
Issue
Total service outage on sms-sagat
Outage Length
23 hours and 3 minutes.
Underlying cause
A HDD developed bad sectors which caused the RAID array to fail. Counter-intuitively, the healthy drive was forced out of the array due to the faulty drive being able to replicate its data to it.
Symptoms
Our internal and external monitoring probes did not report a fault. The machine was still responding to ICMP and HTTP tests, some customer websites appeared to be browsable, but the read-only file system meant that they were effectively unusable.
Resolution
We immediately removed the drives from the server and installed them into a brand new server, to rule out power supply, motherboard or backplane failure.
We attempted to rebuild the RAID array in order to remove the failed drive, however this proved futile.
Instead, we were forced to create a new single disk RAID array, then recover the data from the failing disk to this one. We did encounter issues with the MySQL databases and this posed a concern that there may have been data loss. However, indications show data was retrieved without error. Retrieving the data took a considerable amount of time and was the primary cause of such an extended outage.
The likelihood of a brand new drive failing, within 10 days of 2 other drives failing is unprecedented - an almost impossible possibility. We have not completed testing the old chassis, but at this point - we can only assume a power related issue that could have been overvolting the drives and causing drive failure.
No incidents reported
No incidents reported
No incidents reported