Disruption
Post-Mortem
Our report from the incident is as follows.
Issue
Loss of connectivity, high load and periods of unavailability for the entire MA3 facility and a single isolated network segment.
Outage Length
The duration was between 60 to 180 minutes.
Underlying cause
Unfortunately, this was a repeat incident of similar nature to https://status.sonassi.com/incident/149/
We believe a malfunctioning aggregation switch as part of the backbone of the network core began sending out malformed/erroneous L2 packets, driving up CPU utilisation on other routers and switches. Control plane traffic was disturbed and multi-chassis aggregated links degraded, resulting in loss of downstream connectivity to rack pods and subsequent customer stacks.
Different symptoms to the last incident lead diagnosis ultimately down an incorrect path of initial resolution, leading to extended resolution times and the isolation of an entire network segment (a single rack pod).
Symptoms
Our facilities monitoring, and service monitoring probes immediately reported the incident. Customers would have experienced slow page load times through to a completely inaccessible site.
Resolution
A repeat incident of a malfunctioning aggregation switch appeared to be the source of increased CPU load throughout the network; permanently powering off the device (with subsequent replacement due) resolved the underlying issue.
Prevention
The network architecture and equipment in use is more than adequate, offering extreme levels of availability, with multiple layers of redundancy designed into every tier of the network stack. However, a failed/failing device has caused significant network disruption wherein the device had no explicitly failed, demonstrated signs of error or malfunction - other than the successful operation of the network in its absence.
Whilst rare, the eventual degradation of the switch silicon is lead to be the cause and the device (and its paired device) will be replaced with latest generation hardware.