Wednesday 14th May 2014

Network Network connectivity fault

Some ISPs are reporting connectivity issues.

  • Update (11:56): One of our transit providers (Cogent) have suffered a total router failure in Manchester, causing 100% traffic loss for the routes out of that network.
  • Update (11:58): Our routers automatically removed Cogent from the routing pool and traffic is flowing over other carriers. Failover was automatic and instantaneous, BGP route updates converged within around 3 minutes as expected.
  • Update (12:51): Cogent confirmed a line card failed in a core router in Manchester which lead to subsequent packet loss. The failed line card has been replaced and service is 100% restored. Our routers have once again begun flowing traffic over Cogent.

Post-Mortem

Our report from the incident is as follows.

Issue

Minor network outage

Outage Length

3 seconds

Underlying cause

One of our transit providers (Cogent) experienced a router failure within their network. Increasing CPU usage on their core router caused packets to be progressively dropped.

Symptoms

Our external monitoring probes immediately reported the fault. Some customers (whose traffic was routed over Cogent), experienced an extremely brief window (<1 minute) of slow page load times or server inaccessibility.

Resolution

Once the packet loss threshold was hit, our internal BGP latency and packet loss measuring device automatically de-preferenced Cogent from the available BGP routes. Once Cogent was removed, traffic continued to flow out over our remaining carriers as normal.

Convergence took <5 seconds, but propagation at other ISPs may have taken a couple of minutes, which is why some customers may have experienced a slightly longer outage.

Our automated systems and monitoring systems behaved exactly as designed for this disaster scenario and recovered the carrier failure in less than 5 seconds.