sms-sagat has stopped responding, root partition has been remounted read-only.
Our report from the incident is as follows.
Total service outage on sms-sagat
23 hours and 3 minutes.
A HDD developed bad sectors which caused the RAID array to fail. Counter-intuitively, the healthy drive was forced out of the array due to the faulty drive being able to replicate its data to it.
Our internal and external monitoring probes did not report a fault. The machine was still responding to ICMP and HTTP tests, some customer websites appeared to be browsable, but the read-only file system meant that they were effectively unusable.
We immediately removed the drives from the server and installed them into a brand new server, to rule out power supply, motherboard or backplane failure.
We attempted to rebuild the RAID array in order to remove the failed drive, however this proved futile.
Instead, we were forced to create a new single disk RAID array, then recover the data from the failing disk to this one. We did encounter issues with the MySQL databases and this posed a concern that there may have been data loss. However, indications show data was retrieved without error. Retrieving the data took a considerable amount of time and was the primary cause of such an extended outage.
The likelihood of a brand new drive failing, within 10 days of 2 other drives failing is unprecedented - an almost impossible possibility. We have not completed testing the old chassis, but at this point - we can only assume a power related issue that could have been overvolting the drives and causing drive failure.