Thursday 5th December 2013

Network Network connectivity fault

We have become aware of core network issue with one of our providers. Updates to follow.

  • Update (12:14): Core services restored
  • Update (12:28): Customer services have been restored

Post-Mortem

Our report from the incident is as follows.

Issue

Packet loss affecting worldwide connectivity.

Outage Length

25 minutes

Underlying cause

We had been made aware from one of our carriers that maintenance would be conducted between 12am and 6am; we had made provisions for this and were prepared for systems to automatically switch over during the outage on one carrier.

However, the failover did not occur as planned, as it appeared our other carrier was also affected.

Symptoms

Our internal and external monitoring probes immediately reported a fault.

Resolution

The BGP sessions were not able to be re-established despite efforts. As a last resort, both routers were consecutively rebooted by an on-site technician. This re-established the BGP sessions and connectivity was restored.

We believe there many be commonality between the carriers (shared fibre/conduits/backhauls). An investigation has been launched to see why and how both were simultaneously affected. 

Services are being closely monitored and it is likely that some failover tests will be conducted throughout the week to test and guarantee against future failure.