Sonassi

No incidents reported

Network Network Interruption

Disruption

Update (17:47): Some customers have reported connectivity issues with their stacks.
Update (18:25): The connectivity has now been restored on the majority of our network but some customers will still be affected.
Update (19:21): We are still suffering connectivity issues across our network, our engineers are investigating.
Update (19:34): The majority of our network has connectivity but some customers are still affected.
Update (20:09): Connectivity has been restored. Please get in touch with us if you are still experiencing issues.
Update (21:30): Services are confirmed fully operational and the underlying issue appears to have been tracked to a failing/malfunctioning core aggregation switch. This has been permanently powered off and will be replaced. A full RFO will be available within 24 hours.

Post-Mortem

Our report from the incident is as follows.

Issue

Loss of connectivity, high load and periods of unavailability for the entire MA3 facility and a single isolated network segment.

Outage Length

The duration was between 60 to 180 minutes.

Underlying cause

Unfortunately, this was a repeat incident of similar nature to https://status.sonassi.com/incident/149/

We believe a malfunctioning aggregation switch as part of the backbone of the network core began sending out malformed/erroneous L2 packets, driving up CPU utilisation on other routers and switches. Control plane traffic was disturbed and multi-chassis aggregated links degraded, resulting in loss of downstream connectivity to rack pods and subsequent customer stacks.

Different symptoms to the last incident lead diagnosis ultimately down an incorrect path of initial resolution, leading to extended resolution times and the isolation of an entire network segment (a single rack pod).

Symptoms

Our facilities monitoring, and service monitoring probes immediately reported the incident. Customers would have experienced slow page load times through to a completely inaccessible site.

Resolution

A repeat incident of a malfunctioning aggregation switch appeared to be the source of increased CPU load throughout the network; permanently powering off the device (with subsequent replacement due) resolved the underlying issue.

Prevention

The network architecture and equipment in use is more than adequate, offering extreme levels of availability, with multiple layers of redundancy designed into every tier of the network stack. However, a failed/failing device has caused significant network disruption wherein the device had no explicitly failed, demonstrated signs of error or malfunction - other than the successful operation of the network in its absence.

Whilst rare, the eventual degradation of the switch silicon is lead to be the cause and the device (and its paired device) will be replaced with latest generation hardware.

No incidents reported

3rd Party 3rd Party Service Interruption

Disruption

Update (17:51) Cloudflare have reported the issue resolved at 13:02 UTC, we haven't had any further reports of issues so are marking this resolved.
Update (11:40): Customers are reporting issues with 504 Gateway Timeouts; commonalities between customers are the use of CloudFlare. We are recommending that all customers disable CloudFlare completely before raising a ticket for support.
Update (12:08): CloudFlare has now updated their status page to reflect an outage.

Network Network Interruption

Disruption

Update (13:04): We are currently investigating an incident affecting core routers at MA3.
Update (13:37): The issue has been identified and fixed and we are in the process of restoring connectivity on individual customer stacks.
Update (14:39): The majority of affected customer stacks are online, where we are working through the remaining affected stacks.
Update (16:25): All services are now fully restored. Customers are advised to raise an emergency ticket if they are still experiencing issues.

Post-Mortem

Our report from the incident is as follows.

Issue

Loss of connectivity, high load and periods of unavailability for the entire MA3 facility.

Outage Length

The duration was between 55 to 95 minutes.

Underlying cause

A surge in CPU load on both core routers caused disruption in control plane traffic, where forwarding and routing were operating without issue.

Customer stacks feature two network interfaces, attached to two, diverse switching and routing networks; the active interface is selected by performing an reachability check (using ARP) to its respective gateway. As the core routers were not responding to control plane traffic, ARP requests were being dropped, resulting in stacks taking both primary and secondary interfaces offline; ultimately severing all connectivity.

HA stacks were further affected by this issue, where the shut down interfaces lead to each member of the cluster not being able to reach each other and a "split brain" scenario occurring. Even when connectivity was restored, manual intervention was required to address the split brain.

Symptoms

Our facilities monitoring, and service monitoring probes immediately reported the incident. Customers would have experienced slow page load times through to a completely inaccessible site.

Resolution

A malfunctioning aggregation switch appeared to be the source of increased CPU load throughout the network; rebooting the affected switch was sufficient to allow CPU loads to drop and for stacks to "bring up" their network interfaces after having a successful ARP check.

Prevention

We have identified several areas in which improvements can be made,

In the event of all interfaces failing ARP checks; interface monitoring should revert back to "MII" monitoring (Layer 1, versus L3). Should this incident happen again, all stacks will restore their connectivity within 60 seconds; greatly reducing the impact and subsequent downtime.
This fix has already been deployed.
Only a single ARP target was previously set (the redundant gateway), which lead to all ARP traffic being sent to a single core router. This has been expanded to include the second core router and where applicable, other HA cluster members. This should greatly reduce the event of HA failover and subsequent split-brain occurring.
This fix has already been deployed.

In addition to the above; deploying newer generation aggregation switches and core routers will go a long way towards addressing control plan capacity. Substantial investment in extremely high capacity, latest generation hardware, has already been made and the timeline for replacement of existing hardware will be brought forwards.

Past Incidents

Saturday 29th June 2019

Friday 28th June 2019

Thursday 27th June 2019

Wednesday 26th June 2019

Tuesday 25th June 2019

Monday 24th June 2019

Sunday 23rd June 2019