Experiencing packet loss within network core.
Post-Mortem
Our report from the incident is as follows.
Issue
Significant packet loss, causing over 50% of packets to be dropped to a single rack of equipment and a secondary symptomatic effect of <10% loss within the network core. This had a significant effect on servers at Joule House, causing a total service loss to the servers connected to the respective access switch stack.
Outage Length
The duration was 63 minutes.
Underlying cause
Flooding within a single access switch, causing significant control plane CPU consumption within other network devices.
Symptoms
Our external monitoring probes immediately reported the fault. End users will have noticed the issue as it had an effect on overall service.
Resolution
The switch generating the traffic was observed to be consuming 100% CPU, it was initially power cycled in the hope that the device would become responsive again. Unfortunately, the issue propagated to the remaining 5 switches within the stack (in a single rack), generating further problems.
To avoid major network disruption for the entire location, all access switches were powered off simultaneously, then powered back up one at a time. This restored service and resolved the issue at hand.
Technical reports will be submitted to the vendor for analysis, however, with upcoming access-layer networking upgrades, it is unlikely this will be pursued further.