sms-sagat has stopped responding, initial indications look like an issue with the RAID controller. Updates to follow.
Our report from the incident is as follows.
Total service outage on sms-sagat
27 hours and 30 minutes.
A HDD completely failed in the server leaving the RAID array degraded. However, the remaining drive immediately then began showing signs it was failing, resulting in the server mounting the filesystem as read-only as a precautionary measure against data corruption.
Our internal and external monitoring probes did not report a fault. The machine was still responding to ICMP and HTTP tests, some customer websites appeared to be browsable, but the read-only files ystem meant that they were effectively unusable.
We immediately removed the failed drive and replaced it. We attempted several times to check the filesystem health and copy the data from the failing drive to the new RAID array, eventually resulting in 100% custom data retrieval.
Retrieving the data took a considerable amount of time and was the primary cause of such an extended outage. As we do not operate off-site backups on shared hosting, it meant that this absolutely necessary to recover customer data.
Also a secondary reason for the downtime duration was that our monitoring platform failed to identify an issue, given the machine was responding to healthcheck tests in a normal fashion.
The servers automatically run daily and weekly HDD self- tests, but no early warning signs had been issued, so we had no indication this was about to happen.
As a result of this incident, we will