Disruption
Post-Mortem
Our report from the incident is as follows.
Issue
Loss of connectivity, high load and periods of unavailability for a single rack of equipment.
Outage Length
The duration was up to 18 minutes.
Underlying cause
A Power Supply Unit (PSU) in another server, within the same rack connected to the B power feed catastrophically failed. The catastrophic failure, unlike that of a typical PSU failure, caused a spike/surge in electrical activity, which triggered the Moulded Case Circuit Breaker (MCCB) for the B power feed to trip and cut off power.
On restoration of power of the B feed, the access switch booted into a “fail-safe” environment, in which the configuration was not fully restored. This led to the stack detecting the physical link was back up and began routing traffic over the B network; however the traffic was not being routed correctly.
Symptoms
Our facilities monitoring, and service monitoring probes immediately reported the incident. Customers would have experienced slow page load times through to a completely inaccessible site.
Resolution
Network engineers immediately restored the running configuration of the affected access switches, and gradually services restored normality. The disruption in network flow and packet loss caused delays in service restoration in the mutli-server stack, as each stack member was unable to cleanly communicate with its peers. This in turn caused load averages to increase, resulting in delayed service restoration even after the network connectivity was fully operational.
Prevention
Monitoring, disaster recovery procedures and staff action were followed exactly as anticipated, rapidly identifying the issue, resolving the issue and keeping downtime to a minimum.
This was an exceptional situation and there is nothing we believe could be performed differently to reduce or avoid downtime.