Sonassi

Experiencing packet loss within network core.

Update (20:54): The issue has been traced to an access switch flooding the upstream network with traffic. As it is non responsive, it is being power cycled, which can take up to 5 minutes to complete.
Update (21:05): A power cycle hasn’t resolved the issue. An engineer is on site investigating now.
Update (21:57): The failed device has been removed from the network completely and replaced with a new unit. The device configuration has been applied and traffic appears to be flowing normally amongst other access switches in the stack.

Post-Mortem

Our report from the incident is as follows.

Issue

Significant packet loss, causing over 50% of packets to be dropped to a single rack of equipment and a secondary symptomatic effect of <10% loss within the network core. This had a significant effect on servers at Joule House, causing a total service loss to the servers connected to the respective access switch stack.

Outage Length

The duration was 63 minutes.

Underlying cause

Flooding within a single access switch, causing significant control plane CPU consumption within other network devices.

Symptoms

Our external monitoring probes immediately reported the fault. End users will have noticed the issue as it had an effect on overall service.

Resolution

The switch generating the traffic was observed to be consuming 100% CPU, it was initially power cycled in the hope that the device would become responsive again. Unfortunately, the issue propagated to the remaining 5 switches within the stack (in a single rack), generating further problems.

To avoid major network disruption for the entire location, all access switches were powered off simultaneously, then powered back up one at a time. This restored service and resolved the issue at hand.

Technical reports will be submitted to the vendor for analysis, however, with upcoming access-layer networking upgrades, it is unlikely this will be pursued further.

Tuesday 17th February 2015