Sonassi

Disruption

Update (18:05): Suspected localised network failure affecting a small number of servers.
Update (18:15): The issue looks to be isolated to a single rack and is believed to be power related.
Update (18:30): A circuit breaker has tripped in a single rack, resulting in the fail-over of HA stacks and brief periods of unavailability for single-server stacks.
Update (18:37): Power has been restored, but it looks like on restoration that multiple access switches on the B network have booted back up in a "safe-default" state. The device configurations are being re-provisioned.
Update (18:45): Services are operating normally and being monitored.

Post-Mortem

Our report from the incident is as follows.

Issue

Loss of connectivity, high load and periods of unavailability for a single rack of equipment.

Outage Length

The duration was up to 18 minutes.

Underlying cause

A Power Supply Unit (PSU) in another server, within the same rack connected to the B power feed catastrophically failed. The catastrophic failure, unlike that of a typical PSU failure, caused a spike/surge in electrical activity, which triggered the Moulded Case Circuit Breaker (MCCB) for the B power feed to trip and cut off power.

On restoration of power of the B feed, the access switch booted into a “fail-safe” environment, in which the configuration was not fully restored. This led to the stack detecting the physical link was back up and began routing traffic over the B network; however the traffic was not being routed correctly.

Symptoms

Our facilities monitoring, and service monitoring probes immediately reported the incident. Customers would have experienced slow page load times through to a completely inaccessible site.

Resolution

Network engineers immediately restored the running configuration of the affected access switches, and gradually services restored normality. The disruption in network flow and packet loss caused delays in service restoration in the mutli-server stack, as each stack member was unable to cleanly communicate with its peers. This in turn caused load averages to increase, resulting in delayed service restoration even after the network connectivity was fully operational.

Prevention

Monitoring, disaster recovery procedures and staff action were followed exactly as anticipated, rapidly identifying the issue, resolving the issue and keeping downtime to a minimum.

This was an exceptional situation and there is nothing we believe could be performed differently to reduce or avoid downtime.

Monday 9th July 2018